Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
22 views

Simple Linear Regression

Uploaded by

helderfox
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Simple Linear Regression

Uploaded by

helderfox
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Simple Linear Regression

According to the United Nations Environmental Program (UNEP) Sustainable


Buildings and Climate Initiative, construction trade contributes as much as 30% to
all global greenhouse gas emissions and consumes up to 40% of all energy used
worldwide. Climate change is currently having a powerful impact on how
buildings are designed and constructed.
Predicting numeric outcomes with some accuracy measure is an important facet of
machine learning and data science. For this part, we will use a case study to
understand linear regression and its associated cousins. We will learn about the
assumptions behind linear regression, multiple linear regression, partial least
squares and penalizations. We’ll also focus on strategies for measuring regression
performance and implementations.
In this module, we will develop a multivariate multiple regression model to study
the effect of eight input variables on two output variables, which are the heating
load and the cooling load, of residential buildings. The data provided is from the
energy analysis data of 768 different building shapes. The features provided are the
relative compactness, surface area, wall area, roof area, overall height, orientation,
glazing area and glazing area distribution.
Data Source for content: UCI Machine Learning Repository: Energy efficiency
Data Set
Data Quiz: UCI Machine Learning Repository: Appliances energy prediction Data
Set
Simple Linear Regression
The simple linear regression model.
A simple linear regression model estimates the relationship between two
quantitative variables where one is referred to as the independent variable and the
other the dependent variable. The independent variable (X) is used to predict and
also called the predictor while the predicted variable is referred to as the response
variable (Y) (e.g. finding the relationship between the amount of CO 2 gas emitted
and the number of trees cut down). The value of Y can be obtained from X by
finding the line of best fit (regression line) with minimum error for the data points
on a scatter plot for both variables. A simple linear regression can be represented
as:
y~=\theta_{0}x+\theta_{1}y =θ0x+θ1
where
x ~ is~the~independent~variable,x is the independent variable,
{\theta}_{1}~is ~the~interceptθ1 is the intercept
{\theta}_{0}~is
~the~slope~of~the~line~of~best~fitθ0 is the slope of the line of best fit
{\theta}_0 ~and~ {\theta}_1~are~known~as~regression
~coefficients.θ0 and θ1 are known as regression coefficients.
The UCI Machine Learning Repository: Energy efficiency Data Set is used in this
module for better understanding of the concepts. We select a sample of the dataset
and use the relative compactness column as the predictor and the heating load
column the response variable.
Collinearity and Assumptions for Linear Regression
For better understanding, we explain the assumptions made by linear regression by
comparing results on our energy efficiency dataset and a dummy linear dataset
generated to have similar shape (same number of rows and column) as the energy
efficiency dataset. Some assumptions made by linear regression models about the
data are:
1. Linearity: the relationship between the variables is linear such that a straight
line is the line of best fit.

2.
From the regression plots above, we can see that the residuals of the dummy data
are spread across the regression line as they should be to meet the linearity
assumption unlike the residuals of the energy efficiency dataset which are a bit
farther from the regression line.

2. Homoscedasticity: the residuals or prediction errors are of equal or constant


variance.
The variance of the residuals for the dummy dataset appear to be uniform as
opposed to the energy efficiency dataset which violates this assumption.

3. Normality: the residuals are of a normal

distribution
The energy efficiency dataset flouts this assumption as the residuals are clearly not
normally distributed while the dummy dataset has normally distributed residuals
with the mean and median at 0.
4. Independence of the observations
In multiple linear regression where there are more predictors, it is assumed that
these variables are independent of each other without any strong correlation
between them.
The energy efficiency dataset shows a strong correlation between relative
compactness and surface area, relative compactness and overall height, surface
area and roof area while the variables in the dummy dataset are seen to be
independent of each other.
Overall, before inferences are drawn from a linear regression model, all the
assumptions discussed above must have been met.
Residual sum of squares and minimizing the cost function
A cost function is a measure of the performance of a model i.e. how far or close the
predicted values are to the real values. The objective is to minimise the cost
function in order for the model to continuously learn to obtain better results. In
linear regression, the cost function can be defined as the sum of squared errors in a
training set. The squares of the residuals are taken to penalise errors farther from
the line of best fit more than those closer to the line and obtain the best parameter
values.
Gradient descent and coordinate descent algorithm
Gradient descent is an optimization algorithm that minimizes a cost function by
specifying the direction to move towards to obtain a local or global minima. This is
done by initially starting with random values then iteratively updating the values
until the minimum cost is obtained. A learning rate is usually chosen to determine
the step size to be taken for each iteration. It is important to carefully select this
parameter because, if a small step is chosen, it will take a long time to converge to
the minimum cost while if too large, it can result in an overshoot surpassing the
location of the minimum cost.

Multiple Linear Regression


Unlike simple linear regression, multiple linear regression establishes
the relationship between the response variable and the predictors
(usually two or more). In reality, several factors contribute to a certain
outcome as opposed to just one as suggested by simple linear regression.
Multiple linear regression has similar assumptions as simple linear
regression and also assumes that there is no significant correlation
between the predictors. While the relationship between variables can be
linear, it allows for non-linear relationships that are not straight lines.
Y= 𝜃0 + 𝜃1X1 + 𝜃2X2 + . . . + nXn + ɛ
Collinearity
Correlation is a measure used to describe the linear relationship between
two variables. Correlation values range from -1 for a perfect negative
correlation (an increase in one variable causes a decrease in the other
variable) to +1 for a perfect positive correlation (both variables increase
or decrease together). A correlation value of 0 indicates that there is
absolutely no correlation between both variables. A situation where two
or more of the predictors have a strong correlation is known as
multicollinearity. Since predictors are expected to be independent, when
multicollinearity occurs, the correlated variables cannot independently
contribute to predicting the value of the response variable. In addition,
not all the predictors included are relevant in obtaining better results
from the model. Adding more independent variables to the model is not
always better instead, it might only make the model more complicated.
To resolve this, one of the correlated predictors is selected and the other
removed from the data.
Polynomial Regression
A polynomial regression model is considered a linear regression model
that can be used when a curvilinear relationship exists between the
predictors and the response variable. It can be represented as Y= 𝜃0 +
𝜃1X + 𝜃2X2 + . . . + nXn + ɛ, for a single independent variable where n is
the degree of the polynomial and Y is a linear function of 𝜃. Depending
on the task and data, there might be multiple predictors in a polynomial
regression model which results in more interactions in the model. As
expected, the complexity in the model increases as the degree increases.
- Coefficients of multiple linear regression
- General notations
Measuring Regression Performance
Evaluation Metrics for performance (RSS, R-Squared,
RMSE, MAE etc)
How well a regression model performs can be obtained by how close the
predicted value is to the ground truth. It is very important to use the
appropriate metric to evaluate the performance. In this section, we
discuss some examples of metrics used in evaluating regression models
such as RSS, R-Squared, RMSE and MAE
Mean Absolute Error (MAE)
MAE is easy and intuitive such that it calculates the sum of the average
of the absolute error between the predicted values and the true values.
Since the absolute difference is taken, this metric does not consider
direction. However, because the absolute difference is obtained, it is
unable to give information about the model overshooting or
undershooting. The smaller the MAE is, the better the model. Therefore,
if the MAE is 0, the model is perfect and accurately predicts results
which is almost impossible. The mean absolute error is more robust to
outliers
Residual Sum of Squares (RSS)
Also known as the sum of squared residuals (SSR), this metric explains the
variance in the representation of the dataset by the model; it measures how well the
model approximates the data. A residual is the estimated error made by a model. In
simpler terms, it is the difference between the nth true value and the nth predicted
value by the model. RSS is the sum of the square of errors between the residuals in
a model. The lower the RSS, the better the model’s estimations and vice versa.
Root Mean Square Error (RMSE)
This is the same as the mean square error (MSE) but, it is the square root of the
MSE that estimates the standard deviation of the residuals. It describes the spread
of the residuals from the line of best fit and the noise in the model. When the
RMSE is low, it means that the error made by the model has a small deviation from
the true values. It is calculated by summing the squares of the residuals and
dividing by the number of observations.
R-Squared
Also known as the coefficient of determination, r-squared is a metric used in
regression to determine the goodness of fit of the model. With values ranging from
0 to 1, It gives information on the percentage of the response variable explained
by the model. Mostly, the higher the value, the better the model however, this is
not necessarily always true.
Model complexity, Underfitting and Overfitting
Model complexity refers to the number of input features used to train a model and
the algorithmic learning complexity. An overly complex model can be difficult to
interpret, prone to overfitting and also require more computing. When creating
models, it is imperative for the model to generalise well enough to make
reasonable predictions on new and unseen data. An overfit model will perform well
on the training data and poorly on unseen data. While a model is required to learn
the actual relationship of the variables in the training set, an overfit model
memorises the training set, fits the noise, outliers and irrelevant information, then
makes predictions based on this noise which is incorrect. On the other hand, when
a model is too simple, it can be as a result of having very few features not
sufficient enough to learn details and relationships in the data. In a later section,
we will discuss methods that can be used to achieve optimal and acceptable model
complexities while avoiding overfitting and underfitting.
The Bias-Variance tradeoff
Bias and variance are common occurrences in machine learning and there is a
constant struggle to achieve low bias and variance. Bias is a measure of correctness
of a model i.e. how far off is a model from being correct? While high bias results
in an increase in the error by making assumptions which prevent the model from
capturing relevant relationships between the predictors and response variable, low
bias gives lower error and also prevents underfitting by capturing important
relationships. On the other hand, variance tells how much the values estimated by a
model will vary across different training data. When the variance is low, it means
that there is only a small change in the estimate of the model with new training
data. A high variance causes overfitting such that the changes in estimates obtained
with new training data is large because the model is so complex that it has now
learnt patterns from one training data such that it cannot generalise to other training
sets. While it is essential to obtain low bias and low variance, it is almost
impossible to achieve this simultaneously which is where the ‘bias-variance
tradeoff’ occurs.

Penalization Methods
Regulating over- and under-fitting
Regularization is a method used to make complex models simpler by penalising
coefficients to reduce their magnitude, variance in the training set and in turn,
reduce overfitting in the model. Regularization occurs by shrinking the coefficients
in the model towards zero such that the complexity term added to the model will
result in a bigger loss for models with a higher complexity . There are two types of
regression techniques such as Ridge and Lasso regression.
Ridge Regression
Also known as L2 Regularisation, this is a technique that uses a penalty term to
shrink the magnitude of coefficients towards zero without eliminating them. The
shrinkage prevents overfitting caused by the complexity of the model or
collinearity. It includes the square magnitude of the coefficients to the loss function
as the penalty term. If the error is defined as the square of residual, when a L2
regularization term is added, the result is the equation below.

As lambda increases, the penalty increases causing more coefficients to shrink in


the same vein, if lambda is zero, it results in the loss function.
Feature Selection, The LASSO Regression and Elastic Net
Feature Selection and Lasso Regression
Some datasets can be high dimensional with a very high number of features and
some of them not contributing towards predicting the response variable. As a
result, it becomes more computationally expensive to train a model and can also
introduce noise causing the model to perform poorly. The process of selecting
significant features that contribute the most in obtaining high performing models is
known as feature selection. Lasso regression (Least Absolute Shrinkage and
Selection Operator) reduces overfitting of the dataset by penalising the coefficients
such that some coefficients are shrunk to zero and, indirectly performs feature
selection by selecting only a subset of features leaving only relevant variables that
minimize prediction errors. By using L1 regularisation, it includes the absolute
value of the magnitude to the loss function. The application of L1 regularisation
(Lasso regression) results in simpler and sparse models that allow for better
interpretation. Although lasso regression helps prevent overfitting, one major
limitation is that it does not consider other factors when eliminating predictors. For
example, it arbitrarily eliminates a variable from a correlated pair which might not
be a good rational from a human perspective. When a L1 regularization term is
added, the result is the equation below.
Elastic Net Regression
This is simply a combination of the L1 and L2 penalties from ridge and lasso
regression. This method arose from the need to overcome the limitations of lasso
regression. It regularizes and performs feature selection simultaneously by initially
finding the optimal values of the coefficients as in ridge then performs a shrinkage.

Non-Linear Regression Methods and Other


Recommendations
Model Tuning and choosing parameters
Machine learning models are parameterized such that there has to be a search for
the combination of parameters that will result in the optimal performance of the
model. The parameters that define the model architecture are referred to as
hyperparameters while the process of exploring a range of values is called
hyperparameter tuning. It is important to note the distinction between model
parameters and hyperparameters. Unlike hyperparameters, model parameters are
learnt during the training phase while setting hyperparameters is exclusive of the
training process. Ideally, when hyperparameter tuning is completed, the result is
the best parameters for the model. Grid search and random search are two common
strategies for tuning hyperparameters.
Grid Search
Grid search explores the combination of a grid of parameters such that for every
combination of parameters, a model is built and evaluated then the model with the
best result selected and its corresponding parameters. While it is computationally
expensive, setting up a grid search is quite easy.
Random Search
As opposed to grid search, random search randomly combines parameter values in
the grid to build and evaluate models. It does not sequentially combine all
parameters as in grid search instead, it allows for a quick exploration of the entire
action space to reach optimal values.
Data splitting, resampling and cross validation strategy
Data splitting in data science involves setting aside a portion of the dataset for
testing (out of sample or hold-out) and evaluating the performance of the model to
provide unbiased results while the rest is used in fitting the model. The proportion
of division is solely based on choice and sometimes, the size of the dataset.
However a common practice is to split the dataset into training, validation or dev
and testing sets where the validation set is used to tune the hyperparameters to
select the best values for the model. Resampling involves repeatedly selecting
samples from the original dataset and using these samples to obtain more
information about the model. This can create different samples of the training set
and another for evaluation. Cross validation is a method used to generalise and
prevent overfitting in machine learning
Dataset Description
The dataset for the remainder of this quiz is the Appliances Energy Prediction data.
The data set is at 10 min for about 4.5 months. The house temperature and
humidity conditions were monitored with a ZigBee wireless sensor network. Each
wireless node transmitted the temperature and humidity conditions around 3.3 min.
Then, the wireless data was averaged for 10 minutes periods. The energy data was
logged every 10 minutes with m-bus energy meters. Weather from the nearest
airport weather station (Chievres Airport, Belgium) was downloaded from a public
data set from Reliable Prognosis (rp5.ru), and merged together with the
experimental data sets using the date and time column. Two random variables have
been included in the data set for testing the regression models and to filter out non
predictive attributes (parameters). The attribute information can be seen below.
Attribute Information:
Date, time year-month-day hour:minute:second
Appliances, energy use in Wh
lights, energy use of light fixtures in the house in Wh
T1, Temperature in kitchen area, in Celsius
RH_1, Humidity in kitchen area, in %
T2, Temperature in living room area, in Celsius
RH_2, Humidity in living room area, in %
T3, Temperature in laundry room area
RH_3, Humidity in laundry room area, in %
T4, Temperature in office room, in Celsius
RH_4, Humidity in office room, in %
T5, Temperature in bathroom, in Celsius
RH_5, Humidity in bathroom, in %
T6, Temperature outside the building (north side), in Celsius
RH_6, Humidity outside the building (north side), in %
T7, Temperature in ironing room , in Celsius
RH_7, Humidity in ironing room, in %
T8, Temperature in teenager room 2, in Celsius
RH_8, Humidity in teenager room 2, in %
T9, Temperature in parents room, in Celsius
RH_9, Humidity in parents room, in %
To, Temperature outside (from Chievres weather station), in Celsius
Pressure (from Chievres weather station), in mm Hg
RH_out, Humidity outside (from Chievres weather station), in %
Wind speed (from Chievres weather station), in m/s
Visibility (from Chievres weather station), in km
Tdewpoint (from Chievres weather station), Â °C
rv1, Random variable 1, nondimensional
rv2, Random variable 2, nondimensional
To answer some questions, you will need to normalize the dataset using the
MinMaxScaler after removing the following columns: [“date”, “lights”]. The
target variable is “Appliances”. Use a 70-30 train-test set split with a random
state of 42 (for reproducibility). Run a multiple linear regression using the
training set and evaluate your model on the test set.

You might also like