Predicting The Price of A Used Car Using A Regression Model With Python
Predicting The Price of A Used Car Using A Regression Model With Python
0 1 2 3 4 5 6 7 8 \
0 3 ? alfa-romero gas std two convertible rwd front
1 3 ? alfa-romero gas std two convertible rwd front
2 1 ? alfa-romero gas std two hatchback rwd front
3 2 164 audi gas std four sedan fwd front
4 2 164 audi gas std four sedan 4wd front
.. .. ... ... ... ... ... ... ... ...
200 -1 95 volvo gas std four sedan rwd front
201 -1 95 volvo gas turbo four sedan rwd front
202 -1 95 volvo gas std four sedan rwd front
203 -1 95 volvo diesel turbo four sedan rwd front
204 -1 95 volvo gas turbo four sedan rwd front
9 ... 16 17 18 19 20 21 22 23 24 25
0 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
.. ... ... ... ... ... ... ... ... ... .. .. ...
200 109.1 ... 141 mpfi 3.78 3.15 9.5 114 5400 23 28 16845
201 109.1 ... 141 mpfi 3.78 3.15 8.7 160 5300 19 25 19045
202 109.1 ... 173 mpfi 3.58 2.87 8.8 134 5500 18 23 21485
203 109.1 ... 145 idi 3.01 3.40 23.0 106 4800 26 27 22470
204 109.1 ... 141 mpfi 3.78 3.15 9.5 114 5400 19 25 22625
0 1 2 3 4 5 6 7 8 9 ...
\
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ...
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ...
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ...
3 2 164 audi gas std four sedan fwd front 99.8 ...
4 2 164 audi gas std four sedan 4wd front 99.4 ...
16 17 18 19 20 21 22 23 24 25
0 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
[5 rows x 26 columns]
headers = ["symboling","normalized-losses","make","fuel-type","aspiration",
"num-of-doors","body-style",
"drive-wheels","engine-location","wheel-base",
"length","width","height","curb-weight","engine-type",
"num-of-cylinders", "engine-size","fuel-
system","bore","stroke","compression-ratio","horsepower",
"peak-rpm","city-mpg","highway-mpg","price"]
print("headers\n", headers)
headers
['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-
of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base',
'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-
cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-
ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
# Basic Analysis
df.shape
# The df has 205 records with 26 attributes
(205, 26)
df.describe ()
# describe will only work on numeric data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symboling 205 non-null int64
1 normalized-losses 205 non-null object
2 make 205 non-null object
3 fuel-type 205 non-null object
4 aspiration 205 non-null object
5 num-of-doors 205 non-null object
6 body-style 205 non-null object
7 drive-wheels 205 non-null object
8 engine-location 205 non-null object
9 wheel-base 205 non-null float64
10 length 205 non-null float64
11 width 205 non-null float64
12 height 205 non-null float64
13 curb-weight 205 non-null int64
14 engine-type 205 non-null object
15 num-of-cylinders 205 non-null object
16 engine-size 205 non-null int64
17 fuel-system 205 non-null object
18 bore 205 non-null object
19 stroke 205 non-null object
20 compression-ratio 205 non-null float64
21 horsepower 205 non-null object
22 peak-rpm 205 non-null object
23 city-mpg 205 non-null int64
24 highway-mpg 205 non-null int64
25 price 205 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB
[5 rows x 26 columns]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-29-7f92af5bf2bd> in <module>
1 #Count missing values in each column
----> 2 for column in missing_data.columns.values.tolist():
3 print(column)
4 print (missing_data[column].value_counts())
5 print("")
Based on the summary above, each column has 205 rows of data, seven columns containing
missing data:
"normalized-losses": 41 missing data "num-of-doors": 2 missing data "bore": 4 missing data
"stroke" : 4 missing data "horsepower": 2 missing data "peak-rpm": 2 missing data "price":
4 missing data we do not want to replace the value of price as it is the dependent variable
but we can drop the records with null value
replace data
a. replace it by mean- for numeric data
b. replace it by frequency- for categorical data
c. replace it based on other functions
Repalce by mean columns-- normalized-losses(41), stroke (4), bore(4), horsepower(2),
peak-rpm(2) Repalce by frequency -- num of doors (2),
# Calculate the avg of the column
avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
print("Average of normalized-losses:", avg_norm_loss)
# first we defined avg function then print
avg_stroke = df["stroke"].astype("float").mean(axis=0)
print("Average of stroke:", avg_stroke)
df["stroke"].replace(np.nan, avg_stroke, inplace=True)
avg_bore=df['bore'].astype('float').mean(axis=0)
print("Average of bore:", avg_bore)
df["bore"].replace(np.nan, avg_bore, inplace=True)
avg_horsepower = df['horsepower'].astype('float').mean(axis=0)
print("Average horsepower:", avg_horsepower)
df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)
avg_peakrpm=df['peak-rpm'].astype('float').mean(axis=0)
print("Average peak rpm:", avg_peakrpm)
df['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)
four 114
two 89
Name: num-of-doors, dtype: int64
#We can also use the ".idxmax()" method to calculate for us the most common
type automatically:
df['num-of-doors'].value_counts().idxmax()
'four'
#replace the missing 'num-of-doors' values by the most frequent
df["num-of-doors"].replace(np.nan, "four", inplace=True)
#Finally, let's drop all rows that do not have price data:
# simply drop whole row with NaN in "price" column
df.dropna(subset=["price"], axis=0, inplace=True)
# axis=0 for row, axis =1 for dropping the column
df.head(7)
# Now we got data with no missing values
highway-mpg price
0 27 13495
1 27 16500
2 26 16500
3 30 13950
4 22 17450
5 25 15250
6 25 17710
[7 rows x 26 columns]
Data Formatting
Correct data format In Pandas, we use
.dtype() to check the data type
.astype() to change the data type
# First see the data types
df.dtypes
symboling int64
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
price object
dtype: object
As we can see above, some columns are not of the correct data type. Numerical variables
should have type 'float' or 'int', and variables with strings such as categories should have
type 'object'. For example, 'bore' and 'stroke' variables are numerical values that describe
the engines, so we should expect them to be of the type 'float' or 'int'; however, they are
shown as type 'object'. We have to convert data types into a proper format for each column
using the "astype()" method.
df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")
df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
df[["price"]] = df[["price"]].astype("float")
df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")
df.dtypes
# Now everything is in correct format
symboling int64
normalized-losses int32
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower object
peak-rpm float64
city-mpg int64
highway-mpg int64
price float64
dtype: object
df.head()
[5 rows x 26 columns]
Data Standardization :
Data is usually collected from different agencies with different formats. Standardization is
the process of transforming data into a common format which allows the researcher to
make the meaningful comparison. Example: NY, newyork to NY. In this case we will convert
Transform mpg to L/100km: The formula for unit conversion is
L/100km = 235 / mpg
# Convert mpg to L/100km by mathematical operation (235 divided by mpg)
df['city-L/100km'] = 235/df["city-mpg"]
[5 rows x 27 columns]
Data Normalization
Why normalization?
Normalization is the process of transforming values of several variables into a similar
range. Typical normalizations include scaling the variable so the variable average is 0,
scaling the variable so the variance is 1, or scaling variable so the variable values range
from 0 to 1
Example
To demonstrate normalization, let's say we want to scale the columns "length", "width" and
"height"
Target:would like to Normalize those variables so their value ranges from 0 to 1.
Approach: replace original value by (original value)/(maximum value)
# replace (original value) by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()
df['height'] = df['height']/df['height'].max()
Binning
Why binning?
Binning is a process of transforming continuous numerical variables into discrete
categorical 'bins', for grouped analysis. In our dataset, "horsepower" is a real valued
variable ranging from 48 to 288, it has 57 unique values. What if we only care about the
price difference between cars with high horsepower, medium horsepower, and little
horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis?
We will use the Pandas method 'cut' to segment the 'horsepower' column into 3 bins
# Since we are binning hp we have to convert the type to int from object
df["horsepower"]=df["horsepower"].astype(int, copy=True)
(array([44., 45., 48., 24., 14., 16., 5., 4., 0., 1.]),
array([ 48. , 69.4, 90.8, 112.2, 133.6, 155. , 176.4, 197.8, 219.2,
240.6, 262. ]),
<BarContainer object of 10 artists>)
# But we are missing the labeling
# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")
#
plt.pyplot.hist(df["horsepower"])
(array([44., 45., 48., 24., 14., 16., 5., 4., 0., 1.]),
array([ 48. , 69.4, 90.8, 112.2, 133.6, 155. , 176.4, 197.8, 219.2,
240.6, 262. ]),
<BarContainer object of 10 artists>)
# We want 3 equal size bandwith for binning
#We would like 3 bins of equal size bandwidth so we use numpy's
linspace(start_value, end_value, numbers_generated function.
fuel-type-diesel fuel-type-gas
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
aspiration-std aspiration-turbo
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
df.head()
[5 rows x 29 columns]
df.columns
aspiration-turbo
count 201.000000
unique NaN
top NaN
freq NaN
mean 0.179104
std 0.384397
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
df.describe()
# Summary of numeric variables
aspiration-turbo
count 201.000000
mean 0.179104
std 0.384397
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
[8 rows x 21 columns]
df.describe(include=['object'])
#Summary of categorial variables
make num-of-doors body-style drive-wheels engine-location \
count 201 201 201 201 201
unique 22 2 5 3 2
top toyota four sedan fwd front
freq 32 115 94 118 198
Value Counts
Value-counts is a good way of understanding how many units of each
characteristic/variable we have. Don’t forget the method "value_counts" only works on
Pandas series, not Pandas Dataframes. As a result, we only include one bracket "df['drive-
wheels']" not two brackets "df[['drive-wheels']]".
df['drive-wheels'].value_counts()
fwd 118
rwd 75
4wd 8
Name: drive-wheels, dtype: int64
df['drive-wheels'].value_counts().to_frame()
drive-wheels
fwd 118
rwd 75
4wd 8
# Let's repeat the above steps but save the results to the dataframe
#"drive_wheels_counts" and rename the column 'drive-wheels' to
'value_counts'.
# drive_wheels_counts as variable
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'},
inplace=True)
drive_wheels_counts.index.name = 'drive-wheels'
drive_wheels_counts.head(10)
value_counts
drive-wheels
fwd 118
rwd 75
4wd 8
drive wheels counts can be a good predictor as it is distributed almost evenly except for the
4wd
# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'},
inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)
value_counts
engine-location
front 198
rear 3
Examining the value counts of the engine location would not be a good predictor variable
for the price. This is because we only have three cars with a rear engine and 198 with an
engine in the front, this result is skewed. Thus, we are not able to draw any conclusions
about the engine location.
symboling int64
normalized-losses int32
make object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore float64
stroke float64
compression-ratio float64
horsepower int32
peak-rpm float64
city-mpg int64
highway-mpg int64
price float64
city-L/100km float64
fuel-type-diesel uint8
fuel-type-gas uint8
aspiration-std uint8
aspiration-turbo uint8
dtype: object
engine-size price
engine-size 1.000000 0.872335
price 0.872335 1.000000
The corr between price and engine size is 0.87 which signifies there is a +ve relationship
between them. If engine size increases by 1 unit the price goes up by 0.87 unit, other
factors remaining constant
# Let's find the scatterplot of "engine-size" and "price"
<AxesSubplot:xlabel='engine-size', ylabel='price'>
Let's look for some more relationships like highway milage, peak rpm, stroke on price
df[['highway-mpg', 'price']].corr()
# As the highway-mpg goes up, the price goes down:
#this indicates an inverse/negative relationship between these two variables.
#Highway mpg could potentially be a predictor of price.
highway-mpg price
highway-mpg 1.000000 -0.704692
price -0.704692 1.000000
The correlation between 'highway-mpg' and 'price' and see it's approximately -0.704
sns.regplot(x="highway-mpg", y="price", data=df)
<AxesSubplot:xlabel='highway-mpg', ylabel='price'>
df[['peak-rpm','price']].corr()
peak-rpm price
peak-rpm 1.000000 -0.101616
price -0.101616 1.000000
The correlation between 'peak-rpm' and 'price' and see it's approximately -0.101616
sns.regplot(x="peak-rpm", y="price", data=df)
<AxesSubplot:xlabel='peak-rpm', ylabel='price'>
df[["stroke","price"]].corr()
stroke price
stroke 1.000000 0.082269
price 0.082269 1.000000
<AxesSubplot:xlabel='stroke', ylabel='price'>
P-value:
The P-value is the probability value that the correlation between these two variables is
statistically significant. Normally, we choose a significance level of 0.05, which means that
we are 95% confident that the correlation between the variables is significant.
By convention, when the
p-value is < 0.001: we say there is strong evidence that the correlation is significant.
the p-value is < 0.05: there is moderate evidence that the correlation is significant.
the p-value is < 0.1: there is weak evidence that the correlation is significant.
the p-value is > 0.1: there is no evidence that the correlation is significant.
# We can obtain this information using "stats" module in the "scipy" library.
wheel-base price
wheel-base 1.000000 0.584642
price 0.584642 1.000000
The correlation is 0.58, means there is a positive (weak) relationship between wheel-base
and price.
# Visualization
sns.regplot(x="wheel-base", y="price", data=df)
<AxesSubplot:xlabel='wheel-base', ylabel='price'>
Conclusion: Since the p-value is < 0.001, the correlation between wheel-base and price is
statistically significant, although the linear relationship isn't extremely strong (~0.585)
# Horsepower vs Price
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-
value of P = ", p_value)
Since the p-value is < 0.001, the correlation between horsepower and price is statistically
significant, and the linear relationship is quite strong (~0.809, close to 1)
# Length vs Price
pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-
value of P = ", p_value)
Since the p-value is < 0.001, the correlation between length and price is statistically
significant, and the linear relationship is moderately strong (~0.691).
# Width vs Price
pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-
value of P =", p_value )
Since the p-value is < 0.001, the correlation between width and price is statistically
significant, and the linear relationship is quite strong (~0.751).
# Curb-weight vs Price
pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-
value of P = ", p_value)
Since the p-value is < 0.001, the correlation between curb-weight and price is statistically
significant, and the linear relationship is quite strong (~0.834).
# Engine-size vs Price
pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-
value of P =", p_value)
Since the p-value is < 0.001, the correlation between engine-size and price is statistically
significant, and the linear relationship is very strong (~0.872).
# Bore vs Price
pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-
value of P = ", p_value )
Since the p-value is <0.001, the correlation between bore and price is statistically
significant, but the linear relationship is only moderate (~0.521).
# City-mpg vs Price
pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-
value of P = ", p_value)
Since the p-value is < 0.001, the correlation between city-mpg and price is statistically
significant, and the coefficient of ~ -0.687 shows that the relationship is negative and
moderately strong.
# Highway-mpg vs Price
pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-
value of P = ", p_value )
Since the p-value is < 0.001, the correlation between highway-mpg and price is statistically
significant, and the coefficient of ~ -0.705 shows that the relationship is negative and
moderately strong.
<AxesSubplot:xlabel='body-style', ylabel='price'>
We see that the distributions of price between the different body-style categories have a
significant overlap, and so body-style would not be a good predictor of price
#Let's examine engine "engine-location" and "price":
sns.boxplot(x="engine-location", y="price", data=df)
<AxesSubplot:xlabel='engine-location', ylabel='price'>
Here we see that the distribution of price between these two engine-location categories,
front and rear, are distinct enough to take engine-location as a potential good predictor of
price.
# Let's examine "drive-wheels" and "price"
sns.boxplot(x="drive-wheels", y="price", data=df)
<AxesSubplot:xlabel='drive-wheels', ylabel='price'>
Here we see that the distribution of price between the different drive-wheels categories
differs; as such drive-wheels could potentially be a predictor of price.
Basics of Grouping
The "groupby" method groups data by different categories. The data is grouped based on
one or several variables and analysis is performed on the individual groups.
For example, let's group by the variable "drive-wheels". We see that there are 3 different
categories of drive wheels.
df['drive-wheels'].unique()
#If we want to know, on average, which type of drive wheel is most valuable,
# we can group "drive-wheels" and then average them.
# grouping results
df_group_one = df[['drive-wheels','body-style','price']]
# Here We selected the columns 'drive-wheels', 'body-style' and 'price',
# then assign it to the variable "df_group_one"
# grouping results (drive wheels only)
df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()
df_group_one
drive-wheels price
0 4wd 10241.000000
1 fwd 9244.779661
2 rwd 19757.613333
From our data, it seems rear-wheel drive vehicles are, on average, the most expensive,
while 4-wheel and front-wheel are approximately the same in price.
# grouping results (body style only)
# df_group_one = df_group_one.groupby(['price'],as_index=False).mean()
# df_group_one
# Not working
#grouping results
df_gptest2 = df[['body-style','price']]
grouped_test_bodystyle = df_gptest2.groupby(['body-style'],as_index=
False).mean()
grouped_test_bodystyle
body-style price
0 convertible 21890.500000
1 hardtop 22208.500000
2 hatchback 9957.441176
3 sedan 14459.755319
4 wagon 12371.960000
# grouping results
df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-
style'],as_index=False).mean()
grouped_test1
price \
body-style convertible hardtop hatchback sedan
drive-wheels
4wd NaN NaN 7603.000000 12647.333333
fwd 11595.0 8249.000000 8396.387755 9811.800000
rwd 23949.6 24202.714286 14337.777778 21711.833333
body-style wagon
drive-wheels
4wd 9095.750000
fwd 9997.333333
rwd 16994.222222
Often, we won't have data for some of the pivot cells. We can fill these missing cells with the
value 0, but any other value could potentially be used as well. It should be mentioned that
missing data is quite a complex subject and is an entire course on its own
#fill missing values with 0
grouped_pivot = grouped_pivot.fillna(0)
grouped_pivot
price \
body-style convertible hardtop hatchback sedan
drive-wheels
4wd 0.0 0.000000 7603.000000 12647.333333
fwd 11595.0 8249.000000 8396.387755 9811.800000
rwd 23949.6 24202.714286 14337.777778 21711.833333
body-style wagon
drive-wheels
4wd 9095.750000
fwd 9997.333333
rwd 16994.222222
heat map
The heatmap plots the target variable (price) proportional to colour with respect to the
variables 'drive-wheel' and 'body-style' in the vertical and horizontal axis respectively. This
allows us to visualize how the price is related to 'drive-wheel' and 'body-style'.
import matplotlib.pyplot as plt
%matplotlib inline
#use the grouped results
plt.pcolor(grouped_pivot, cmap='RdBu')
plt.colorbar()
plt.show()
#The default labels convey no useful information to us. Let's change that:
fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')
#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index
#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)
fig.colorbar(im)
plt.show()
Categorical variables:
Drive-wheels
As we now move into building machine learning models to automate our analysis, feeding
the model with variables that meaningfully affect our target variable will improve our
model's prediction performance.
Model Development
In this section, we will develop several models that will predict the price of the car using
the variables or features. This is just an estimate but should give us an objective idea of
how much the car should cost.
Model/Estimator is a mathmatical equation to predict a given value from one/more given
values.
Some questions we want to ask in this module
do I know if the dealer is offering fair value for my trade-in?
do I know if I put a fair value on my car?
Data Analytics, we often use Model Development to help us predict future observations
from the data we have.
A Model will help us understand the exact relationship between different variables and
how these variables are used to predict the result. There are many ways to
estimate/predict the outcome. Here we will use
1. Linear regression- simple/multiple
2. Polynomial regression and pipelines
3. Model using decision tree
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Linear Regression
One example of a Data Model that we will be using is Simple Linear Regression.
Simple Linear Regression is a method to help us understand the relationship between two
variables:
The predictor/independent variable (X)
The response/dependent variable (that we want to predict)(Y)
The result of Linear Regression is a linear function that predicts the response (dependent)
variable as a function of the predictor (independent) variable.
𝑌:Dependent (𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒) X:Independent (𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟) 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠
Linear function: 𝑌ℎ𝑎𝑡=𝑎+𝑏𝑋
a refers to the intercept of the regression line0, in other words: the value
of Y when X is 0
b refers to the slope of the regression line, in other words: the value with
which Y changes when X increases by 1 unit
# Steps
# Set of points-->fit the model-->parameters--->regress--->estimator
# import library
from sklearn.linear_model import LinearRegression
38423.305858157386
lm.coef_
array([-821.73337832])
The final estimated linear model is price = 38423.31 - 821.73 x highway-mpg The
relationship sifnifies
1. The value of price is 38423 when the mpg is 0
2. The value of price decreases by 821 unit when mpg increases by 1 unit
# Let us practice few more models
# Price and engine size
lm1 = LinearRegression()
lm1
X = df[['engine-size']]
Y = df['price']
lm1.fit(X,Y)
# Slope
lm1.coef_
# Intercept
lm1.intercept_
print ( "The slope of the model is", lm1.coef_, " with intercept ",
lm1.intercept_ )
(0.0, 48170.301416875096)
We can see from this plot that price is negatively correlated to highway-mpg, since the
regression slope is negative. One thing to keep in mind when looking at a regression plot is
to pay attention to how scattered the data points are around the regression line. This will
give you a good indication of the variance of the data, and whether a linear model would be
the best fit or not. If the data is too far off from the line, this linear model might not be the
best model for this data. In this case, it semms the data points are too scattered around 25
mpg and two many outlier points as well
# Pracice 2
# Let's compare this plot to the regression plot of "peak-rpm".
plt.figure(figsize=(width, height))
sns.regplot(x="peak-rpm", y="price", data=df)
plt.ylim(0,)
(0.0, 47414.1)
Comparing the regression plot of "peak-rpm" and "highway-mpg" we see that the points for
"highway-mpg" are much closer to the generated line and on the average decrease. The
points for "peak-rpm" have more spread around the predicted line, and it is much harder to
determine if the points are decreasing or increasing as the "highway-mpg" increases.
The variable "highway-mpg" has a stronger correlation with "price", it is approximate -
0.704692 compared to "peak-rpm" which is approximate -0.101616. You can verify it using
the following command:
df[["peak-rpm","highway-mpg","price"]].corr()
C:\Users\User\anaconda3\lib\site-packages\seaborn\_decorators.py:36:
FutureWarning: Pass the following variables as keyword args: x, y. From
version 0.12, the only valid positional argument will be `data`, and passing
other arguments without an explicit keyword will result in an error or
misinterpretation.
warnings.warn(
• We can see from this residual plot that the residuals are not randomly spread
around the x-axis, which leads us to believe that maybe a non-linear model is more
appropriate for this data.
plt.legend()
plt.show()
plt.close()
C:\Users\User\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-
level function with similar flexibility) or `kdeplot` (an axes-level function
for kernel density plots).
warnings.warn(msg, FutureWarning)
C:\Users\User\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-
level function with similar flexibility) or `kdeplot` (an axes-level function
for kernel density plots).
warnings.warn(msg, FutureWarning)
We can see that the fitted values are reasonably close to the actual values, since the two
distributions overlap a bit. However, there is definitely some room for improvement.
Polynomial Regression and Pipelines
• Polynomial regression is a particular case of the general linear regression model or
multiple linear regression models.
We get non-linear relationships by squaring or setting higher-order terms of the predictor
variables.
There are different orders of polynomial regression:
Quadratic - 2nd order
• 𝑌ℎ𝑎𝑡=𝑎+𝑏1𝑋+𝑏2𝑋2 Cubic - 3rd order
• 𝑌ℎ𝑎𝑡=𝑎+𝑏1𝑋+𝑏2𝑋2+𝑏3𝑋3 Higher order:
• 𝑌=𝑎+𝑏1𝑋+𝑏2𝑋2+𝑏3𝑋3....
We saw earlier that a linear model did not provide the best fit while using highway-mpg as
the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.
# We will use the following function to plot the data:
plt.show()
plt.close()
3 2
-1.557 x + 204.8 x - 8965 x + 1.379e+05
# Let's plot the function
PlotPolly(p, x, y, 'highway-mpg')
np.polyfit(x, y, 3)
We can already see from plotting that this polynomial model performs better than the
linear model. This is because the generated polynomial function "hits" more of the data
points.
# Let us check for 5th degree
f = np.polyfit(x, y, 5)
p = np.poly1d(f)
PlotPolly(p, x, y, 'highway-mpg')
# Here we can see 5th order fits the data more
# Check for 11th order
f = np.polyfit(x, y, 11)
p = np.poly1d(f)
PlotPolly(p, x, y, 'highway-mpg')
The higher order we go the better it fits. But sometimes it can get overfitted or even
underfitted. We will discuss this isuues later
Z.shape # the original data has 201 samples with thrree features
(201, 3)
(201, 10)
#### Data Pipeline Data Pipelines simplify the steps of processing the data. We use the
module Pipeline to create a pipeline. We also use StandardScaler as a step in our pipeline.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
#We create the pipeline, by creating a list of tuples including the name of
the model
# or estimator and its corresponding constructor.
Input=[('scale',StandardScaler()), ('polynomial',
PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]
#we input the list as an argument to the pipeline constructor
pipe=Pipeline(Input)
pipe
# We can normalize the data, perform a transform and fit the model
simultaneously.
pipe.fit(Z,y)
# Similarly, we can normalize the data, perform a transform and produce a
prediction simultaneously
ypipe=pipe.predict(Z)
ypipe[0:4]
• So first crated the pipeline, transformed the function to fit the model and then
predicted the model
# Let us practice with 10 predictes value of the model
Input=[('scale',StandardScaler()),('model',LinearRegression())]
pipe=Pipeline(Input)
pipe.fit(Z,y)
ypipe=pipe.predict(Z)
ypipe[0:10]
• R-squared
R squared, also known as the coefficient of determination, is a measure to indicate how
close the data is to the fitted regression line.
The value of the R-squared is the percentage of variation of the response variable (y) that is
explained by a linear model.
• Mean Squared Error (MSE)
The Mean Squared Error measures the average of the squares of errors, that is, the
difference between actual value (y) and the estimated value (ŷ).
• Measures for simple linear regeession using highway mpg
# R^2
#highway_mpg_fit
X = df[['highway-mpg']]
Y = df['price']
lm.fit(X, Y)
# Find the R^2
print('The R-square is: ', lm.score(X, Y))
We can say that ~ 49.659% of the variation of the price is explained by this simple linear
model "highway-mpg". R square close to 1 is agood fit.
# MSE
# Import Libraries
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(df['price'], Yhat)
print('The mean square error of price and predicted value is: ', mse)
The mean square error of price and predicted value is: 31635042.944639895
We can say that ~ 80.896 % of the variation of price is explained by this multiple linear
regression "multi_fit".
# Let's calculate the MSE
Y_predict_multifit = lm.predict(Z)
print('The mean square error of price and predicted value using multifit is:
', \
mean_squared_error(df['price'], Y_predict_multifit))
The mean square error of price and predicted value using multifit is:
12786870.010381931
We can say that ~ 67.419 % of the variation of price is explained by this polynomial fit
#We can also calculate the MSE:
mean_squared_error(df['price'], p(x))
18703127.639051963
%matplotlib inline
# Produce a prediction
yhat=lm.predict(new_input)
yhat[0:5]
When comparing models, the model with the higher R-squared value is a better fit for the
data.
What is a good MSE?
When comparing models, the model with the smallest MSE value is a better fit for the data.
Let's take a look at the values for the different models.
Simple Linear Regression: Using Highway-mpg as a Predictor Variable of Price.
R-squared: 0.49659118843391759
MSE: 3.16 x10^7
Simple Linear Regression model (SLR) vs Multiple Linear Regression model (MLR)
Usually, the more variables you have, the better your model is at predicting, but this is not
always true. Sometimes you may not have enough data, you may run into numerical
problems, or many of the variables may not be useful and or even act as noise. As a result,
you should always check the MSE and R^2.
So to be able to compare the results of the MLR vs SLR models, we look at a combination of
both the R-squared and MSE to make the best conclusion about the fit of the model.
MSE : The MSE of SLR is 3.16x10^7 while MLR has an MSE of 1.2 x10^7. The MSE
of MLR is much smaller.
R-squared: In this case, we can also see that there is a big difference
between the R-squared of the SLR and the R-squared of the MLR. The R-squared
for the SLR (~0.497) is very small compared to the R-squared for the MLR
(~0.809).
This R-squared in combination with the MSE show that MLR seems like the better model fit
in this case, compared to SLR. Simple Linear Model (SLR) vs Polynomial Fit
MSE: We can see that Polynomial Fit brought down the MSE, since this MSE is
smaller than the one from the SLR.
R-squared: The R-squared for the Polyfit is larger than the R-squared for the
SLR, so the Polynomial Fit also brought up the R-squared quite a bit.
Since the Polynomial Fit resulted in a lower MSE and a higher R-squared, we can conclude
that this was a better fit model than the simple linear regression for predicting Price with
Highway-mpg as a predictor variable. Multiple Linear Regression (MLR) vs Polynomial Fit
MSE: The MSE for the MLR is smaller than the MSE for the Polynomial Fit.
R-squared: The R-squared for the MLR is also much larger than for the
Polynomial Fit.
• Comparing these three models, we conclude that the MLR model is the best model to
be able to predict price from our dataset. This result makes sense, since we have 27
variables in total, and we know that more than one of those variables are potential
predictors of the final car price.
plt.title(Title)
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')
plt.show()
plt.close()
#training data
#testing data
# lr: linear regression object
#poly_transform: polynomial transformation object
xmax=max([xtrain.values.max(), xtest.values.max()])
xmin=min([xtrain.values.min(), xtest.values.min()])
%%capture
! pip install ipywidgets
# For interactive interface
from ipywidgets import interact, interactive, fixed, interact_manual
plt.title(Title)
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')
plt.show()
plt.close()
# Define PollyPlot
def PollyPlot(xtrain, xtest, y_train, y_test, lr,poly_transform):
width = 12
height = 10
plt.figure(figsize=(width, height))
#training data
#testing data
# lr: linear regression object
#poly_transform: polynomial transformation object
xmax=max([xtrain.values.max(), xtest.values.max()])
xmin=min([xtrain.values.min(), xtest.values.min()])
LinearRegression()
-0.02688896960469922
0.5083627320666815
• we can see the R^2 is much smaller using the train data.
# Let us use 40 percent data as test data
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data,
test_size=0.4, random_state=0)
lre.fit(x_train1[['highway-mpg']],y_train1)
lre.score(x_test1[['highway-mpg']],y_test1)
0.48301138227946017
lre.score(x_train1[['highway-mpg']], y_train1)
0.4934016641097354
Cross-validation Score
Sometimes you do not have sufficient testing data; as a result, you may want to perform
Cross-validation. Let's go over several methods that you can use for Cross-validation.
#Lets import model_selection from the module cross_val_score
from sklearn.model_selection import cross_val_score
# We input the object, the feature in this case ' highway-mpg', the target
data (y_data).
# The parameter 'cv' determines the number of folds; in this case 4.
Rcross = cross_val_score(lre, x_data[['highway-mpg']], y_data, cv=4)
#The default scoring is R^2; each element in the array has the average R^2
value in the fold:
Rcross
The mean of the folds are 0.3881662072564622 and the standard deviation is
0.1697268682335954
0.3689378324027451
• You can also use the function 'cross_val_predict' to predict the output. The function
splits up the data into the specified number of folds, using one fold for testing and
the other folds are used for training.
# import the function
from sklearn.model_selection import cross_val_predict
yhat = cross_val_predict(lre,x_data[['highway-mpg']], y_data,cv=4)
yhat[0:5]
# Let's examine the distribution of the predicted values of the training data
plt.figure(figsize=(width, height))
C:\Users\User\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-
level function with similar flexibility) or `kdeplot` (an axes-level function
for kernel density plots).
warnings.warn(msg, FutureWarning)
C:\Users\User\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-
level function with similar flexibility) or `kdeplot` (an axes-level function
for kernel density plots).
warnings.warn(msg, FutureWarning)
• So far the model seems to be doing well in learning from the training dataset. But
what happens when the model encounters new data from the testing dataset? When
the model generates new values from the test data, we see the distribution of the
predicted values is much different from the actual target values.
# Let's examine the distribution of the predicted values of the test data
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.figure(figsize=(width, height))
plt.show()
plt.close()
print ('the red line is for actual test values and the blue one is the
predicted test values')
C:\Users\User\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-
level function with similar flexibility) or `kdeplot` (an axes-level function
for kernel density plots).
warnings.warn(msg, FutureWarning)
C:\Users\User\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-
level function with similar flexibility) or `kdeplot` (an axes-level function
for kernel density plots).
warnings.warn(msg, FutureWarning)
the red line is for actual test values and the blue one is the predicted test
values
• the red line is for actual test values and the blue one is the predicted test values
• Comparing Figure 1 and Figure 2; it is evident the distribution of the test data in
Figure 1 is much better at fitting the data. This difference in Figure 2 is apparent
where the ranges are from 5000 to 15 000. This is where the distribution shape is
exceptionally different
Overfitting
Overfitting occurs when the model fits the noise, not the underlying process. Therefore
when testing your model using the test-set, your model does not perform as well as it is
modelling noise, not the underlying process that generated the relationship.
PolynomialFeatures(degree=5)
#Let's take the first five predicted values and compare it to the actual
target
print("Predicted values:", yhat[0:5])
print("True values:", y_test[0:5].values)
PolynomialFeatures(degree=11)
poly = LinearRegression()
poly.fit(x_train_pr, y_train)
yhat = poly.predict(x_test_pr)
yhat[0:5]
PollyPlot(x_train[['highway-mpg']], x_test[['highway-mpg']], y_train, y_test,
poly,pr)
pr = PolynomialFeatures(degree=20)
x_train_pr = pr.fit_transform(x_train[['highway-mpg']])
x_test_pr = pr.fit_transform(x_test[['highway-mpg']])
pr
poly = LinearRegression()
poly.fit(x_train_pr, y_train)
yhat = poly.predict(x_test_pr)
yhat[0:5]
PollyPlot(x_train[['highway-mpg']], x_test[['highway-mpg']], y_train, y_test,
poly,pr)
• 11th order does very good but 20th order did not do good around 40-50 mpg. This
is an example of overfitting
# R^2 of the training data
poly.score(x_train_pr, y_train)
0.49596883970465033
0.46760941866758243
• The lower the R^2, the worse the model, a Negative R^2 is a sign of overfitting.
# Interactive interface
def f(order, test_data):
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=test_data, random_state=0)
pr = PolynomialFeatures(degree=order)
x_train_pr = pr.fit_transform(x_train[['highway-mpg']])
x_test_pr = pr.fit_transform(x_test[['highway-mpg']])
poly = LinearRegression()
poly.fit(x_train_pr,y_train)
PollyPlot(x_train[['highway-mpg']], x_test[['highway-mpg']],
y_train,y_test, poly, pr)
# The following interface allows you to experiment with different polynomial
orders
#and different amounts of data
interact(f, order=(0, 25, 1), test_data=(0.05, 0.95, 0.05))
{"model_id":"36594b54c7ff4384b30aa4842fdf5d2f","version_major":2,"version_min
or":0}
# Let's see how the R^2 changes on the test data for different order
polynomials and plot the results
Rsqu_test = []
x_train_pr = pr.fit_transform(x_train[['highway-mpg']])
x_test_pr = pr.fit_transform(x_test[['highway-mpg']])
lr.fit(x_train_pr, y_train)
Rsqu_test.append(lr.score(x_test_pr, y_test))
plt.plot(order, Rsqu_test)
plt.xlabel('order')
plt.ylabel('R^2')
plt.title('R^2 Using Test Data')
plt.text(3, 0.75, 'Maximum R^2 ')
(180, 10)
plt.legend()
plt.show()
plt.close()
C:\Users\User\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-
level function with similar flexibility) or `kdeplot` (an axes-level function
for kernel density plots).
warnings.warn(msg, FutureWarning)
C:\Users\User\anaconda3\lib\site-packages\seaborn\distributions.py:2551:
FutureWarning: `distplot` is a deprecated function and will be removed in a
future version. Please adapt your code to use either `displot` (a figure-
level function with similar flexibility) or `kdeplot` (an axes-level function
for kernel density plots).
warnings.warn(msg, FutureWarning)
• The predicted value is higher than actual value for cars where the price 10,000
range. conversely the predicted price is lower than the price cost in the $30,000 to
$40,000 range. As such the model is not as accurate in these ranges.
Ridge regression
Ridge regression is a way to create a parsimonious model when the number of predictor
variables in a set exceeds the number of observations, or when a data set has
multicollinearity (correlations between predictor variables).
• Ridge Regression vs. Least Squares
Least squares regression isn’t defined at all when the number of predictors exceeds the
number of observations; It doesn’t differentiate “important” from “less-important”
predictors in a model, so it includes all of them. This leads to overfitting a model and failure
to find unique solutions. Least squares also has issues dealing with multicollinearity in
data. Ridge regression avoids all of these problems. It works in part because it doesn’t
require unbiased estimators; While least squares produces unbiased estimates, variances
can be so large that they may be wholly inaccurate. Ridge regression adds just enough bias
to make the estimates reasonably reliable approximations to true population values.
• In this section, we will review Ridge Regression we will see how the parameter Alfa
changes the model. Just a note here our test data will be used as validation data.
# Let's perform a degree two polynomial transformation on our data
pr=PolynomialFeatures(degree=2)
x_train_pr=pr.fit_transform(x_train[[ 'curb-weight', 'engine-size', 'highway-
mpg','normalized-losses','symboling']])
x_test_pr=pr.fit_transform(x_test[[ 'curb-weight', 'engine-size', 'highway-
mpg','normalized-losses','symboling']])
# Like regular regression, you can fit the model using the method fit
RigeModel.fit(x_train_pr, y_train)
C:\Users\User\anaconda3\lib\site-packages\sklearn\linear_model\_ridge.py:147:
LinAlgWarning: Ill-conditioned matrix (rcond=6.00252e-17): result may not be
accurate.
return linalg.solve(A, Xy, sym_pos=True,
Ridge(alpha=0.1)
#Let's compare the first five predicted samples to our test set
print('predicted:', yhat[0:5])
print('test set :', y_test[0:5].values)
#We select the value of Alpha that minimizes the test error, for example, we
can use a for loop.
Rsqu_test = []
Rsqu_train = []
dummy1 = []
Alpha = 10 * np.array(range(0,1000))
for alpha in Alpha:
RigeModel = Ridge(alpha=alpha)
RigeModel.fit(x_train_pr, y_train)
Rsqu_test.append(RigeModel.score(x_test_pr, y_test))
Rsqu_train.append(RigeModel.score(x_train_pr, y_train))
<matplotlib.legend.Legend at 0x1e8cf9dd550>
0.5710640373206048
GridSearchCV(cv=4, estimator=Ridge(),
param_grid=[{'alpha': [0.001, 0.1, 1, 10, 100, 1000, 10000,
100000,
100000]}])
• The object finds the best parameter values on the validation data. We can obtain the
estimator with the best parameters and assign it to the variable BestRR as follows
BestRR=Grid1.best_estimator_ BestRR
BestRR=Grid1.best_estimator_
BestRR
Ridge(alpha=1000)
0.6430504126335327
• Practice : Perform a grid search for the alpha parameter and the normalization
parameter, then find the best values of the parameters
parameters2= [{'alpha': [0.001,0.1,1, 10, 100,
1000,10000,100000,100000],'normalize':[True,False]} ]
Grid2 = GridSearchCV(Ridge(), parameters2,cv=4)
Grid2.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-
mpg']],y_data)
Grid2.best_estimator_
Ridge(alpha=0.1, normalize=True)
(201, 21)
df.describe()
aspiration-turbo
count 201.000000
mean 0.179104
std 0.384397
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
[8 rows x 21 columns]
df.describe(include='all')
aspiration-turbo
count 201.000000
mean 0.179104
std 0.384397
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
[8 rows x 21 columns]
df.values
DecisionTreeClassifier()
Copyright®
GitHub: https://github.com/fin-mustakim