UNIT I Notes-1

The family of linear models
 The family of linear models is so named because the function that specifies the relationship
between the X, the predictors, and the y, the target, is a linear combination of the X values.
 a linear model is simply a smarter form of a summation.
 the predictors should tell us something, they should give us some hint about the answer variable;
otherwise any machine learning algorithm won't work properly. We can predict our response
because the information about the answer is already somewhere inside the features, maybe
scattered, twisted, or transformed, but it is just there. Machine learning just gathers and
reconstructs such information.
 In statistics, the linear model family is called the generalized linear model (GLM).
 a linear model using just a single predictor variable is called simple linear regression.
The code for downloading the data is as follows:
from sklearn.datasets import fetch_california_housing

from sklearn.datasets import load_boston
boston = load_boston()
california = fetch_california_housing()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
dataset['target'] = boston.target
The simplest is the statistical mean. In fact, you can simply guess by always using the same constant
number, and the mean very well absolves such a role because it is a powerful descriptive number for data
summary. The mean works very well with normally distributed data but often it is quite suitable even for
different distributions. A normally distributed curve is a distribution of data that is symmetric and has
certain characteristics regarding its shape.
The key to understanding if a distribution is normal is the probability density function (PDF), a function
describing the probability of values in the distribution. In the case of a normal distribution, the PDF is as
follows:
In such a formulation, the symbol µ represents the mean (which coincides with the median and the
mode) and the symbol σ is the variance. Based on different means and variances, we can calculate
different value distributions, as the following code demonstrates and visualizes:
import matplotlib.pyplot as plt

import numpy as np
import matplotlib.mlab as mlab
import math
x = np.linspace(-4,4,100)
for mean, variance in [(0,0.7),(0,1),(1,1.5),(-2,0.5)]:
plt.plot(x,mlab.normpdf(x,mean,variance))
plt.show()
mean_expected_value = dataset['target'].mean()
or
np.mean(dataset['target'])
Statistics suggest that, to measure the difference between the prediction and the real value, we should
square the differences and then sum them all. This is called the squared sum of errors:
Squared_errors = pd.Series(mean_expected_value - dataset['target'])**2
SSE = np.sum(Squared_errors)
print ('Sum of Squared Errors (SSE): %01.f' % SSE)
density_plot = Squared_errors.plot('hist')
The plot shows how frequent certain errors are in respect of their values. Therefore, you will immediately
notice that most errors are around zero (there is a high density around that value). Such a situation can
be considered a good one, since in most cases the mean is a good approximation, but some errors are
really very far from the zero and they can attain considerable values.
A measure of linear relationship:

The mean is not a good representative of certain values, but it is certainly a good baseline to start from.
Certainly, an important problem with the mean is its being fixed, whereas the target variable is
changeable.
In the real estate business, we actually know that usually the larger a house is, the more expensive it is;
however, this rule is just part of the story and the price is affected by many other considerations. For the
moment, we will keep it simple and just assume that an extension to a house is a factor that positively
affects the price, and consequently, more space equals more costs when building the house (more land,
more construction materials, more work, and consequently a higher price). Now, we have a variable that
we know should change with our target and we just need to measure it and extend our initial formula
based on constant values with something else. In statistics, there is a measure that helps to measure how
(in the sense of how much and in what direction) two variables relate to each other: correlation. In
correlation, a few steps are to be considered.
First, your variables have to be standardized (or your result won't be a correlation but a covariation, a
measure of association that is affected by the scale of the variables you are working with).
In statistical Z score standardization, you subtract from each variable its mean and then you divide the
result by the standard deviation. The resulting transformed variable will have a mean of 0 and a standard
deviation of 1 (or unit variance, since variance is the squared standard deviation). The formula for
standardizing a variable is as follows:
This can be achieved in Python using a simple function:
def standardize(x):
return (x-np.mean(x))/np.std(x)
After standardizing, you compare the squared difference of each variable with its own mean. If
the two differences agree in sign, their multiplication will become positive (evidence that they have the
same directionality); however, if they differ, the multiplication will turn negative. By summing all the
multiplications between the squared differences, and dividing them by the number of observations, you
will finally get the correlation which will be a number ranging from -1 to 1.
The absolute value of the correlation will provide you with the intensity of the relation between
the two variables compared, 1 being a sign of a perfect match and zero a sign of complete
independence between them (they have no relation between them). The sign instead will hint at the
proportionality; positive is direct (when one grows the other does the same), negative is indirect (when
one grows, the other shrinks).
Covariance can be expressed as follows:
Whereas, Pearson's correlation can be expressed as follows:
def covariance(variable_1, variable_2, bias=0):
observations = float(len(variable_1))
return np.sum((variable_1 - np.mean(variable_1)) * (variable_2 -

np.mean(variable_2)))/(observations-min(bias,1))
def standardize(variable):
return (variable - np.mean(variable)) / np.std(variable)

def correlation(var1,var2,bias=0):
return covariance(standardize(var1), standardize(var2),bias)
from scipy.stats.stats import pearsonr
print ('Our correlation estimation: %0.5f' % (correlation(dataset['RM'], dataset['target'])))
print ('Correlation from Scipy pearsonr estimation: %0.5f' % pearsonr(dataset['RM'],

dataset['target'])[0])
Out: Our correlation estimation: 0.69536
Correlation from Scipy pearsonr estimation: 0.69536
Our correlation estimation for the relation between the value of the target variable and the
average number of rooms in houses in the area is 0.695, which is positive and remarkably strong, since
the maximum positive score of a correlation is 1.0.
Let's graph what happens when we correlate two variables. Using a scatterplot, we can easily
visualize the two involved variables. A scatterplot is a graph where the values of two variables are treated
as Cartesian coordinates; thus, for every (x, y) value a point is represented in the graph:
x_range = [dataset['RM'].min(),dataset['RM'].max()]
y_range = [dataset['target'].min(),dataset['target'].max()]
scatter_plot = dataset.plot(kind='scatter', x='RM', y='target',\
xlim=x_range, ylim=y_range)
meanY = scatter_plot.plot(x_range, [dataset['target'].mean(),\
dataset['target'].mean()], '--' , color='red', linewidth=1)
meanX = scatter_plot.plot([dataset['RM'].mean(),\
dataset['RM'].mean()], y_range, '--', color='red', linewidth=1)

A perfect match (correlation values of 1 or -1) is possible only when the points are in a straight
line (and all points are therefore concentrated in the right-uppermost and left-lowermost quadrants).
Thus, correlation is a measure of linear association, of how close to a straight line your points are. Ideally,
having all your points on a single line favors a perfect mapping of your predictor variable to your target.
Extending to linear regression:

Linear regression tries to fit a line through a given set of points, choosing the best fit. The best fit is the
line that minimizes the summed squared difference between the value dictated by the line for a certain
value of x and its corresponding y values.
Since linear regression is a line; in bi-dimensional space (x, y), it takes the form of the classical formula of
a line in a Cartesian plane: y = mx + q, where m is the angular coefficient (expressing the angle between
the line and the x axis) and q is the intercept between the line and the x axis. Formally, machine learning
indicates the correct expression for a linear regression as follows:
Here, again, X is a matrix of the predictors, β is a matrix of coefficients, and β0

is a constant value called the bias
Regressing with Statsmodels:
There are two different methods (two modules) to work out a linear regression with Statsmodels:
• statsmodels.api: This works with distinct predictor and answer variables and requires you to define any
transformation of the variables on the predictor variable, including adding the intercept
• statsmodels.formula.api: This works in a similar way to R, allowing you to specify a functional form (the
formula of the summation of the predictors)
As a first step, let's upload both the modules of Statsmodels, naming them as conventionally indicated in
the package documentation:
import statsmodels.api as sm
import statsmodels.formula.api as smf
As a second step, it is necessary to define the y and X variables:
y = dataset['target']
X = dataset['RM']
X = sm.add_constant(X)
The X variable needs to be extended by a constant value (); the bias will be calculated accordingly. In fact,
as you remember, the formula of a linear regression is as follows:
However, using StatsModels.api, the formula actually becomes the following:
This can be interpreted as a combination of the variables in X, multiplied by its corresponding β value.
Consequently, the predictor X now contains both the predictive variable and a unit constant. Also, β is no
longer a single coefficient, but a vector of coefficients. Let's have a visual confirmation of this by requiring
the first values of the Pandas DataFrame using the head method:
X.head()
At this point, we just need to set the initialization of the linear regression calculation:
linear_regression = sm.OLS(y,X)
Also, we need to ask for the estimation of the regression coefficients, the β vector:
fitted_model = linear_regression.fit()
If we had wanted to manage the same result using the StatsModels.formula.api,
we should have typed the following:
linear_regression = smf.ols(formula='target ~ RM', data=dataset)
The previous two code lines simultaneously comprise both steps seen together, without
requiring any particular variable preparation since the bias is automatically incorporated. In fact,
the specification about how the linear regression should work is incorporated into the string
target ~ RM, where the variable name left of the tilde (~) indicates the answer variable, the
variable name (or names, in the case of a multiple regression analysis) on the right being for the
predictor. Actually, smf.ols expects quite a different input compared to sm.OLS, because it can
accept our entire original dataset (it selects what variables are to be used by using the
provided formula), whereas sm.OLS expects a matrix containing just the features to be used
for prediction.
A summary (a method of the fitted model) can quickly tell you everything that you need to know
about regression analysis. In case you have tried statsmodels.formula. api, we also re-initialize
the linear regression using the StatsModels.api since they are not working on the same X and our
following code relies on sm.OLS specifications:
linear_regression = sm.OLS(y,X)
fitted_model.summary()
We first need to extract two elements from the fitted model: the coefficients and the predictions
calculated on the data on which we built the model.
print (fitted_model.params)
betas = np.array(fitted_model.params)
fitted_values = fitted_model.predict(X)
The coefficient of determination:

The first table is divided into two columns.
The first one contains a description of the fitted model:
• Dep. Variable: It just reminds you what the target variable was
• Model: Another reminder of the model that you have fitted, the OLS is ordinary least squares,
another way to refer to linear regression
• Method: The parameters fitting method (in this case least squares, the classical computation
method)
• No. Observations: The number of observations that have been used
• DF Residuals: The degrees of freedom of the residuals, which is the number of observations
minus the number of parameters
 DF Model: The number of estimated parameters in the model (excluding the constant term
from the count)
The second table gives a more interesting picture, focusing how good
the fit of the linear regression model is and pointing out any possible
problems with the model:
 R-squared: This is the coefficient of determination, a measure of how well the regression
does with respect to a simple mean.
 Adj. R-squared: This is the coefficient of determination adjusted based on the number of
parameters in a model and the number of observations that helped build it.
 F-statistic: This is a measure telling you if, from a statistical point of view, all your coefficients,
apart from the bias and taken together, are different from zero. In simple words, it tells you if
your regression is really better than a simple average.
 Prob (F-statistic): This is the probability that you got that F-statistic just by lucky chance due to
the observations that you have used (such a probability is actually called the p-value of F-
statistic). If it is low enough you can be confident that your regression is really better than a
simple mean. Usually in statistics and science a test probability has to be equal or lower than
0.05 (a conventional criterion of statistical significance) for having such a confidence.
 AIC: This is the Akaike Information Criterion. AIC is a score that evaluates the model based on
the number of observations and the complexity of the model itself. The lesser the AIC score,
the better. It is very useful for comparing different models and for statistical variable
selection.
 BIC: This is the Bayesian Information Criterion. It works as AIC, but it presents a higher
penalty for models with more parameters.
R-squared is much more interesting because it tells you how much better your regression model is in
comparison to a single mean. It does so by providing you with a percentage of the unexplained variance
of a mean as a predictor that actually your model was able to explain.
mean_sum_squared_errors = np.sum((dataset['target']- dataset['target'].mean())**2)
regr_sum_squared_errors = np.sum((dataset['target']- fitted_values)**2)
(mean_sum_squared_errors- regr_sum_squared_errors) / mean_sum_squared_errors

Out: 0.48352545599133412
In our example, it is 0.484, which actually is exactly our R-squared correlation:
(pearsonr(dataset['RM'], dataset['target'])[0])**2
Out: 0.4835254559913339
R-squared is perfectly aligned with the squared errors that the linear regression is trying to minimize;
thus, a better R-squared means a better model.
Meaning and significance of coefficients:

The second output table informs us about the coefficients and provides us with a
series of tests. These tests can make us confident that we have not been fooled by a few
extreme observations in the foundations of our analysis or by some other problem:
• coef: The estimated coefficient
• std err: The standard error of the estimate of the coefficient; the larger it is, the more
uncertain the estimation of the coefficient
• t: The t-statistic value, a measure indicating whether the coefficient true value is different from
zero
• P > |t|: The p-value indicating the probability that the coefficient is different from zero just by
chance
• [95.0% Conf. Interval]: The lower and upper values of the coefficient, considering 95% of all the
chances of having different observations and so different estimated coefficients.
The coefficients are the most important output that we can obtain from our regression model because
they allow us to re-create the weighted summation that can predict our outcomes. In our example, our
coefficients are −34.6706 for the bias (also called the intercept, recalling the formula for a line in a
Cartesian space) and 9.1021 for the RM variable. Recalling our formula, we can plug in the numbers we
obtained:
Now, if you replace the betas and X with the estimated coefficients, and the variables' names with
−34.6706 and 9.1021, everything becomes the following:
For instance, xRM is 4.55:
9.1021*4.55-34.6706
Out: 6.743955
A linear regression can always work within the range of values it learned from (this is called interpolation)
but can provide correct values for its learning boundaries (a different predictive activity called
extrapolation) only in certain conditions.
Evaluating the fitted values:

The last table deals with an analysis of the residuals of the regression. The residuals
are the difference between the target values and the predicted fitted values:
• Skewness: This is a measure of the symmetry of the residuals around the mean. For symmetric
distributed residuals, the value should be around zero. A positive value indicates a long tail to
the right; a negative value a long tail to the left.
• Kurtosis: This is a measure of the shape of the distribution of the residuals. A bell-shaped
distribution has a zero measure. A negative value points to a too flat distribution; a positive
one has too great a peak.
• Omnibus D'Angostino's test: This is a combined statistical test for skewness and kurtosis.
• Prob(Omnibus): This is the Omnibus statistic turned into a probability.
• Jarque-Bera: This is another test of skewness and kurtosis.
• Prob (JB): This is the JB statistic turned into a probability.
• Durbin-Watson: This is a test for the presence of correlation among the residuals (relevant
during analysis of time-based data).
• Cond. No: This is a test for multicollinearity
it is important to keep an eye out for any of these three problems showing up:
1. Values too far from the average. Large standardized residuals hint at a serious difficulty when modeling
such observations. Also, in the process of learning these values, the regression coefficients may have
been distorted.
2. Different variance in respect of the value of the predictor. If the linear regression is an average
conditioned on the predictor, dishomogeneous variance points out that the regression is not working
properly when the predictor has certain values.
3. Strange shapes in the cloud of residual points may indicate that you need a more complex model for
the data you are analyzing.
Correlation is not causation:

Actually, seeing a correlation between your predictor and your target variable, and managing to model it
successfully using a linear regression, doesn't really mean that there is a causal relation between the two
(though your regression may work very well, and even optimally).
We will tell you about six different reasons, and offer a cautionary word to help you
handle such predictors without difficulty:
• Direct causation: x causes y; for instance, in the real estate business the value is directly
proportional to the size of the house in square meters.
• Reciprocal effects: x causes y but it is also influenced by y. This is quite typical of many macro-
economic dynamics where the effect of a policy augments or diminishes its effects. As an
example in real estate, high crime rates in an area can lower its prices but lower prices mean
that the area could quickly become even more degraded and dangerous.
• Spurious causation: This happens when the real cause is actually z, which causes both x and y;
consequently it is just a fallacious illusion that x implies y because it is z behind the scenes. For
instance, the presence of expensive art shops and galleries may seem to correlate with house
prices; in reality, both are determined by the presence of affluent residents.
• Indirect causation: x in reality is not causing y but it is causing something else, which then
causes y. A good municipality investing in infrastructures after higher taxes can indirectly affect
house prices because the area becomes more comfortable to live in, thus attracting more
demand. Higher taxes, and thus more investments, indirectly affect house prices.
• Conditional effect: x causes y in respect of the values of another variable z; for instance, when
z has certain values x is not influencing y but, when z takes particular values, the x starts
impacting y. We also call this situation interaction. For instance the presence of schools in an
area can become an attractor when the crime rate is low, so it affects house prices only when
there is little criminality.
 Random effect: Any recorded correlation between x and y has been due to a lucky sampling
selection; in reality there is no relationship with y at all.
The ideal case is when you have a direct causation; then, you will have a predictor in your model that will
always provide you with the best values to derive your responses. In the other cases, it is likely that the
imperfect cause-effect relationship with the target variable will lead to more noisy estimates, especially in
production when you will have to work with data not seen before by the model. Reciprocal effects are
more typical of econometric models. They require special types of regression analysis. Including them in
your regression analysis may improve your model; however, their role may be underestimated. Spurious
and indirect causes will add some noise to your x and y relationship; this could bring noisier estimates
(larger standard errors). Often, the solution is to get more observations for your analysis. Conditional
effects, if not caught, can limit your model's ability to produce accurate estimates. If you are not aware of
any of them, given your domain knowledge of the problem, it is a good step to check for any of them
using some automatic procedure to test possible interactions between the variables. Random effects are
the worst possible thing that could happen to your model,
Predicting with a regression model:

First, you can rely on the fitted model by providing it with an array containing new cases. In the following
example, you can see how, given the Xp variable with a single new case, this is easily predicted using the
predict method on the fitted model:
RM = 5
Xp = np.array([1,RM])
print ("Our model predicts if RM = %01.f the answer value is %0.1f" % (RM,
fitted_model.predict(Xp)))
Out: Our model predicts if RM = 5 the answer value is 10.8
A nice usage of the predict method is to project the fitted predictions on our previous scatterplot to allow
us to visualize the price dynamics in respect of our predictor, the average number of rooms:
x_range = [dataset['RM'].min(),dataset['RM'].max()]
y_range = [dataset['target'].min(),dataset['target'].max()]
scatter_plot = dataset.plot(kind='scatter', x='RM', y='target', xlim=x_range, ylim=y_range)
meanY = scatter_plot.plot(x_range, [dataset['target'].mean(),dataset['target'].mean()], '--',
color='red', linewidth=1)
meanX =scatter_plot.plot([dataset['RM'].mean(), dataset['RM'].mean()], y_range, '--',

color='red', linewidth=1)
regression_line = scatter_plot.plot(dataset['RM'], fitted_values, '-', color='orange', linewidth=1)
Using the previous code snippet, we will obtain the preceding graphical representation of the regression
line containing a further indication of how such a line crosses the cloud of data points. For instance,
thanks to this graphical display, we can notice that the regression line exactly passes at the intersection of
the x and y averages.
Besides the predict method, generating the predictions is quite easy by just using the dot function in
NumPy. After preparing an X matrix containing both the variable data and the bias (a column of ones) and
the coefficient vectors, all you have to do is to multiply the matrix by the vector. The result will itself be a
vector of length equal to the number of observations:
predictions_by_dot_product = np.dot(X,betas)
print ("Using the prediction method: %s" % fitted_values[:10])
print ("Using betas and a dot product: %s" % predictions_by_dot_product[:10])
Out: Using the prediction method: [ 25.17574577 23.77402099 30.72803225
29.02593787 30.38215211 23.85593997 20.05125842 21.50759586 16.5833549 19.97844155]
Using betas and a dot product: [ 25.17574577 23.77402099 30.72803225 29.02593787

30.38215211 23.85593997 20.05125842 21.50759586 16.5833549 19.97844155]
Regressing with Scikit-learn:

A linear model can be built using a more oriented machine learning package such as Scikit-learn. Using
the linear_model module, we can set a linear regression model specifying that the predictors shouldn't be
normalized and that our model should have a bias:
from sklearn import linear_model
linear_regression = linear_model.LinearRegression(normalize=False,fit_intercept=True)
Data preparation, instead, requires counting the observations and carefully preparing the predictor array
to specify its two dimensions (if left as a vector, the fitting procedure will raise an error):
observations = len(dataset)
X = dataset['RM'].values.reshape((observations,1))
# X should be always a matrix, never a vector
y = dataset['target'].values # y can be a vector
After completing all the previous steps, we can fit the model using the fit method:
linear_regression.fit(X,y)
A very convenient feature of the Scikit-learn package is that all the models, no matter their type of
complexity, share the same methods. The fit method is always used for fitting and it expects an X and a y
(when the model is a supervised one). Instead, the two common methods for making an exact prediction
(always for regression) and its probability (when the model is probabilistic) are predict and predict_proba,
respectively.
After fitting the model, we can inspect the vector of the coefficients and the bias constant:
print (linear_regression.coef_)
print (linear_regression.intercept_)
Out: [ 9.10210898] -34.6706207764
Using the predict method and slicing the first 10 elements of the resulting list, we output the first 10
predicted values:
print (linear_regression.predict(X)[:10])
Out: [ 25.17574577 23.77402099 30.72803225 29.02593787 30.38215211 23.85593997 20.05125842

21.50759586 16.5833549 19.97844155]
If we prepare a new matrix and we add a constant, we can calculate the results by ourselves using a
simple matrix–vector multiplication:
Xp = np.column_stack((X,np.ones(observations)))
v_coef = list(linear_regression.coef_) + [linear_regression.intercept_]
As expected, the result of the product provides us with the same estimates as the predict method:
np.dot(Xp,v_coef)[:10]
Out: array([ 25.17574577, 23.77402099, 30.72803225, 29.02593787, 30.38215211, 23.85593997,

20.05125842, 21.50759586, 16.5833549 , 19.97844155])
The following simple test to generate a large dataset and check the performance of the two versions of
linear regression:
from sklearn.datasets import make_regression
HX, Hy = make_regression(n_samples=10000000, n_features=1, n_targets=1, random_state=101)
After generating ten million observations of a single variable, start by measuring using the %%time magic
function for IPython. This magic function automatically computes how long it takes to complete the
calculations in the IPython cell:
%%time
sk_linear_regression = linear_model.LinearRegression( normalize=False,fit_intercept=True)

sk_linear_regression.fit(HX,Hy)
Out: Wall time: 647 ms
Now, it is the turn of the Statsmodels package:
%%time
sm_linear_regression = sm.OLS(Hy,sm.add_constant(HX))
sm_linear_regression.fit()
Out: Wall time: 2.13 s

Though a single variable is involved in the model, Statsmodels's default algorithms prove to be three
times slower than Scikit-learn.
Minimizing the cost function:

At the core of linear regression, there is the search for a line's equation that it is able to minimize the sum
of the squared errors of the difference between the line's y values and the original ones. Our regression
function is called h, and its predictions h(X), as in this formulation:
Consequently, our cost function to be minimized is as follows:
There are quite a few methods to minimize it, some performing better than others in the presence of
large quantities of data. Among the better performers, the most important ones are Pseudoinverse, QR
factorization, and gradient descent.
Explaining the reason for using squared errors:

 It removes negative values; therefore opposite errors won't reciprocally cancel each other when
summed
• It emphasizes larger differences, because as they are squared they will proportionally increase the sum
of the errors compared to a simple sum of absolute values
import numpy as np
x = np.array([9.5, 8.5, 8.0, 7.0, 6.0])
Let's also define a function returning the cost function as squared differences:
def squared_cost(v,e):
return np.sum((v-e)**2)
Using the fmin minimization procedure offered by the scipy package, we try to
figure out, for a vector (which will be our x vector of values), the value that makes
the least squared summation:
from scipy.optimize import fmin
xopt = fmin(squared_cost, x0=0, xtol=1e-8, args=(x,))
Out: Optimization terminated successfully.

Current function value: 7.300000
Iterations: 44
Function evaluations: 88
We just output our best e value and verify if it actually is the mean of the x vector:
print ('The result of optimization is %0.1f' % (xopt[0]))
print ('The mean is %0.1f' % (np.mean(x)))
The mean is 78.0
If instead we try to figure out what minimizes the sum of absolute errors:
In: def absolute_cost(v,e):
return np.sum(np.abs(v-e))
In: xopt = fmin(absolute_cost, x0=0, xtol=1e-8, args=(x,))
Out: Optimization terminated successfully.
Current function value: 5.000000
Iterations: 44
Function evaluations: 88
In: print ('The result of optimization is %0.1f' % (xopt[0]))
Out: The result of optimization is 78.0
print ('The median is %0.1f' % (np.median(x)))
Out: The result of optimization is 8.0
The median is 8.0
Pseudoinverse and other optimization methods:

UNIT I Notes-1

Uploaded by

Copyright:

Available Formats

UNIT I Notes-1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT I Notes-1

Uploaded by

Copyright:

Available Formats

The family of linear models

The code for downloading the data is as follows:

from sklearn.datasets import fetch_california_housing

import matplotlib.pyplot as plt

import matplotlib.mlab as mlab

for mean, variance in [(0,0.7),(0,1),(1,1.5),(-2,0.5)]:

Squared_errors = pd.Series(mean_expected_value - dataset['target'])**2

print ('Sum of Squared Errors (SSE): %01.f' % SSE)

A measure of linear relationship:

Covariance can be expressed as follows:

Whereas, Pearson's correlation can be expressed as follows:

def covariance(variable_1, variable_2, bias=0):

return np.sum((variable_1 - np.mean(variable_1)) * (variable_2 -

return (variable - np.mean(variable)) / np.std(variable)

return covariance(standardize(var1), standardize(var2),bias)

from scipy.stats.stats import pearsonr

print ('Our correlation estimation: %0.5f' % (correlation(dataset['RM'], dataset['target'])))

print ('Correlation from Scipy pearsonr estimation: %0.5f' % pearsonr(dataset['RM'],

Out: Our correlation estimation: 0.69536

Correlation from Scipy pearsonr estimation: 0.69536

scatter_plot = dataset.plot(kind='scatter', x='RM', y='target',\

meanY = scatter_plot.plot(x_range, [dataset['target'].mean(),\

dataset['target'].mean()], '--' , color='red', linewidth=1)

dataset['RM'].mean()], y_range, '--', color='red', linewidth=1)

Extending to linear regression:

Here, again, X is a matrix of the predictors, β is a matrix of coefficients, and β0

import statsmodels.formula.api as smf

As a second step, it is necessary to define the y and X variables:

However, using StatsModels.api, the formula actually becomes the following:

If we had wanted to manage the same result using the StatsModels.formula.api,

we should have typed the following:

linear_regression = smf.ols(formula='target ~ RM', data=dataset)

The coefficient of determination:

The first one contains a description of the fitted model:

• No. Observations: The number of observations that have been used

mean_sum_squared_errors = np.sum((dataset['target']- dataset['target'].mean())**2)

regr_sum_squared_errors = np.sum((dataset['target']- fitted_values)**2)

(mean_sum_squared_errors- regr_sum_squared_errors) / mean_sum_squared_errors

In our example, it is 0.484, which actually is exactly our R-squared correlation:

Meaning and significance of coefficients:

extreme observations in the foundations of our analysis or by some other problem:

• coef: The estimated coefficient

For instance, xRM is 4.55:

Evaluating the fitted values:

• Prob(Omnibus): This is the Omnibus statistic turned into a probability.

• Jarque-Bera: This is another test of skewness and kurtosis.

• Prob (JB): This is the JB statistic turned into a probability.

• Cond. No: This is a test for multicollinearity

Correlation is not causation:

Predicting with a regression model:

Out: Our model predicts if RM = 5 the answer value is 10.8

scatter_plot = dataset.plot(kind='scatter', x='RM', y='target', xlim=x_range, ylim=y_range)

meanY = scatter_plot.plot(x_range, [dataset['target'].mean(),dataset['target'].mean()], '--',

meanX =scatter_plot.plot([dataset['RM'].mean(), dataset['RM'].mean()], y_range, '--',

regression_line = scatter_plot.plot(dataset['RM'], fitted_values, '-', color='orange', linewidth=1)

print ("Using the prediction method: %s" % fitted_values[:10])

print ("Using betas and a dot product: %s" % predictions_by_dot_product[:10])

Out: Using the prediction method: [ 25.17574577 23.77402099 30.72803225

29.02593787 30.38215211 23.85593997 20.05125842 21.50759586 16.5833549 19.97844155]