UNIT I Notes-1
UNIT I Notes-1
UNIT I Notes-1
The family of linear models is so named because the function that specifies the relationship
between the X, the predictors, and the y, the target, is a linear combination of the X values.
a linear model is simply a smarter form of a summation.
the predictors should tell us something, they should give us some hint about the answer variable;
otherwise any machine learning algorithm won't work properly. We can predict our response
because the information about the answer is already somewhere inside the features, maybe
scattered, twisted, or transformed, but it is just there. Machine learning just gathers and
reconstructs such information.
In statistics, the linear model family is called the generalized linear model (GLM).
a linear model using just a single predictor variable is called simple linear regression.
The simplest is the statistical mean. In fact, you can simply guess by always using the same constant
number, and the mean very well absolves such a role because it is a powerful descriptive number for data
summary. The mean works very well with normally distributed data but often it is quite suitable even for
different distributions. A normally distributed curve is a distribution of data that is symmetric and has
certain characteristics regarding its shape.
The key to understanding if a distribution is normal is the probability density function (PDF), a function
describing the probability of values in the distribution. In the case of a normal distribution, the PDF is as
follows:
In such a formulation, the symbol µ represents the mean (which coincides with the median and the
mode) and the symbol σ is the variance. Based on different means and variances, we can calculate
different value distributions, as the following code demonstrates and visualizes:
import math
x = np.linspace(-4,4,100)
plt.plot(x,mlab.normpdf(x,mean,variance))
plt.show()
mean_expected_value = dataset['target'].mean()
or
np.mean(dataset['target'])
Statistics suggest that, to measure the difference between the prediction and the real value, we should
square the differences and then sum them all. This is called the squared sum of errors:
SSE = np.sum(Squared_errors)
density_plot = Squared_errors.plot('hist')
The plot shows how frequent certain errors are in respect of their values. Therefore, you will immediately
notice that most errors are around zero (there is a high density around that value). Such a situation can
be considered a good one, since in most cases the mean is a good approximation, but some errors are
really very far from the zero and they can attain considerable values.
In the real estate business, we actually know that usually the larger a house is, the more expensive it is;
however, this rule is just part of the story and the price is affected by many other considerations. For the
moment, we will keep it simple and just assume that an extension to a house is a factor that positively
affects the price, and consequently, more space equals more costs when building the house (more land,
more construction materials, more work, and consequently a higher price). Now, we have a variable that
we know should change with our target and we just need to measure it and extend our initial formula
based on constant values with something else. In statistics, there is a measure that helps to measure how
(in the sense of how much and in what direction) two variables relate to each other: correlation. In
correlation, a few steps are to be considered.
First, your variables have to be standardized (or your result won't be a correlation but a covariation, a
measure of association that is affected by the scale of the variables you are working with).
In statistical Z score standardization, you subtract from each variable its mean and then you divide the
result by the standard deviation. The resulting transformed variable will have a mean of 0 and a standard
deviation of 1 (or unit variance, since variance is the squared standard deviation). The formula for
standardizing a variable is as follows:
This can be achieved in Python using a simple function:
def standardize(x):
return (x-np.mean(x))/np.std(x)
After standardizing, you compare the squared difference of each variable with its own mean. If
the two differences agree in sign, their multiplication will become positive (evidence that they have the
same directionality); however, if they differ, the multiplication will turn negative. By summing all the
multiplications between the squared differences, and dividing them by the number of observations, you
will finally get the correlation which will be a number ranging from -1 to 1.
The absolute value of the correlation will provide you with the intensity of the relation between
the two variables compared, 1 being a sign of a perfect match and zero a sign of complete
independence between them (they have no relation between them). The sign instead will hint at the
proportionality; positive is direct (when one grows the other does the same), negative is indirect (when
one grows, the other shrinks).
observations = float(len(variable_1))
def standardize(variable):
Our correlation estimation for the relation between the value of the target variable and the
average number of rooms in houses in the area is 0.695, which is positive and remarkably strong, since
the maximum positive score of a correlation is 1.0.
Let's graph what happens when we correlate two variables. Using a scatterplot, we can easily
visualize the two involved variables. A scatterplot is a graph where the values of two variables are treated
as Cartesian coordinates; thus, for every (x, y) value a point is represented in the graph:
x_range = [dataset['RM'].min(),dataset['RM'].max()]
y_range = [dataset['target'].min(),dataset['target'].max()]
xlim=x_range, ylim=y_range)
meanX = scatter_plot.plot([dataset['RM'].mean(),\
Since linear regression is a line; in bi-dimensional space (x, y), it takes the form of the classical formula of
a line in a Cartesian plane: y = mx + q, where m is the angular coefficient (expressing the angle between
the line and the x axis) and q is the intercept between the line and the x axis. Formally, machine learning
indicates the correct expression for a linear regression as follows:
• statsmodels.api: This works with distinct predictor and answer variables and requires you to define any
transformation of the variables on the predictor variable, including adding the intercept
• statsmodels.formula.api: This works in a similar way to R, allowing you to specify a functional form (the
formula of the summation of the predictors)
As a first step, let's upload both the modules of Statsmodels, naming them as conventionally indicated in
the package documentation:
import statsmodels.api as sm
y = dataset['target']
X = dataset['RM']
X = sm.add_constant(X)
The X variable needs to be extended by a constant value (); the bias will be calculated accordingly. In fact,
as you remember, the formula of a linear regression is as follows:
This can be interpreted as a combination of the variables in X, multiplied by its corresponding β value.
Consequently, the predictor X now contains both the predictive variable and a unit constant. Also, β is no
longer a single coefficient, but a vector of coefficients. Let's have a visual confirmation of this by requiring
the first values of the Pandas DataFrame using the head method:
X.head()
At this point, we just need to set the initialization of the linear regression calculation:
linear_regression = sm.OLS(y,X)
Also, we need to ask for the estimation of the regression coefficients, the β vector:
fitted_model = linear_regression.fit()
fitted_model = linear_regression.fit()
The previous two code lines simultaneously comprise both steps seen together, without
requiring any particular variable preparation since the bias is automatically incorporated. In fact,
the specification about how the linear regression should work is incorporated into the string
target ~ RM, where the variable name left of the tilde (~) indicates the answer variable, the
variable name (or names, in the case of a multiple regression analysis) on the right being for the
predictor. Actually, smf.ols expects quite a different input compared to sm.OLS, because it can
accept our entire original dataset (it selects what variables are to be used by using the
provided formula), whereas sm.OLS expects a matrix containing just the features to be used
for prediction.
A summary (a method of the fitted model) can quickly tell you everything that you need to know
about regression analysis. In case you have tried statsmodels.formula. api, we also re-initialize
the linear regression using the StatsModels.api since they are not working on the same X and our
following code relies on sm.OLS specifications:
linear_regression = sm.OLS(y,X)
fitted_model = linear_regression.fit()
fitted_model.summary()
We first need to extract two elements from the fitted model: the coefficients and the predictions
calculated on the data on which we built the model.
print (fitted_model.params)
betas = np.array(fitted_model.params)
fitted_values = fitted_model.predict(X)
• Dep. Variable: It just reminds you what the target variable was
• Model: Another reminder of the model that you have fitted, the OLS is ordinary least squares,
another way to refer to linear regression
• Method: The parameters fitting method (in this case least squares, the classical computation
method)
• DF Residuals: The degrees of freedom of the residuals, which is the number of observations
minus the number of parameters
DF Model: The number of estimated parameters in the model (excluding the constant term
from the count)
The second table gives a more interesting picture, focusing how good
the fit of the linear regression model is and pointing out any possible
problems with the model:
R-squared: This is the coefficient of determination, a measure of how well the regression
does with respect to a simple mean.
Adj. R-squared: This is the coefficient of determination adjusted based on the number of
parameters in a model and the number of observations that helped build it.
F-statistic: This is a measure telling you if, from a statistical point of view, all your coefficients,
apart from the bias and taken together, are different from zero. In simple words, it tells you if
your regression is really better than a simple average.
Prob (F-statistic): This is the probability that you got that F-statistic just by lucky chance due to
the observations that you have used (such a probability is actually called the p-value of F-
statistic). If it is low enough you can be confident that your regression is really better than a
simple mean. Usually in statistics and science a test probability has to be equal or lower than
0.05 (a conventional criterion of statistical significance) for having such a confidence.
AIC: This is the Akaike Information Criterion. AIC is a score that evaluates the model based on
the number of observations and the complexity of the model itself. The lesser the AIC score,
the better. It is very useful for comparing different models and for statistical variable
selection.
BIC: This is the Bayesian Information Criterion. It works as AIC, but it presents a higher
penalty for models with more parameters.
R-squared is much more interesting because it tells you how much better your regression model is in
comparison to a single mean. It does so by providing you with a percentage of the unexplained variance
of a mean as a predictor that actually your model was able to explain.
(pearsonr(dataset['RM'], dataset['target'])[0])**2
Out: 0.4835254559913339
R-squared is perfectly aligned with the squared errors that the linear regression is trying to minimize;
thus, a better R-squared means a better model.
series of tests. These tests can make us confident that we have not been fooled by a few
• std err: The standard error of the estimate of the coefficient; the larger it is, the more
uncertain the estimation of the coefficient
• t: The t-statistic value, a measure indicating whether the coefficient true value is different from
zero
• P > |t|: The p-value indicating the probability that the coefficient is different from zero just by
chance
• [95.0% Conf. Interval]: The lower and upper values of the coefficient, considering 95% of all the
chances of having different observations and so different estimated coefficients.
The coefficients are the most important output that we can obtain from our regression model because
they allow us to re-create the weighted summation that can predict our outcomes. In our example, our
coefficients are −34.6706 for the bias (also called the intercept, recalling the formula for a line in a
Cartesian space) and 9.1021 for the RM variable. Recalling our formula, we can plug in the numbers we
obtained:
Now, if you replace the betas and X with the estimated coefficients, and the variables' names with
−34.6706 and 9.1021, everything becomes the following:
9.1021*4.55-34.6706
Out: 6.743955
A linear regression can always work within the range of values it learned from (this is called interpolation)
but can provide correct values for its learning boundaries (a different predictive activity called
extrapolation) only in certain conditions.
are the difference between the target values and the predicted fitted values:
• Skewness: This is a measure of the symmetry of the residuals around the mean. For symmetric
distributed residuals, the value should be around zero. A positive value indicates a long tail to
the right; a negative value a long tail to the left.
• Kurtosis: This is a measure of the shape of the distribution of the residuals. A bell-shaped
distribution has a zero measure. A negative value points to a too flat distribution; a positive
one has too great a peak.
• Omnibus D'Angostino's test: This is a combined statistical test for skewness and kurtosis.
• Durbin-Watson: This is a test for the presence of correlation among the residuals (relevant
during analysis of time-based data).
it is important to keep an eye out for any of these three problems showing up:
1. Values too far from the average. Large standardized residuals hint at a serious difficulty when modeling
such observations. Also, in the process of learning these values, the regression coefficients may have
been distorted.
2. Different variance in respect of the value of the predictor. If the linear regression is an average
conditioned on the predictor, dishomogeneous variance points out that the regression is not working
properly when the predictor has certain values.
3. Strange shapes in the cloud of residual points may indicate that you need a more complex model for
the data you are analyzing.
We will tell you about six different reasons, and offer a cautionary word to help you
handle such predictors without difficulty:
• Direct causation: x causes y; for instance, in the real estate business the value is directly
proportional to the size of the house in square meters.
• Reciprocal effects: x causes y but it is also influenced by y. This is quite typical of many macro-
economic dynamics where the effect of a policy augments or diminishes its effects. As an
example in real estate, high crime rates in an area can lower its prices but lower prices mean
that the area could quickly become even more degraded and dangerous.
• Spurious causation: This happens when the real cause is actually z, which causes both x and y;
consequently it is just a fallacious illusion that x implies y because it is z behind the scenes. For
instance, the presence of expensive art shops and galleries may seem to correlate with house
prices; in reality, both are determined by the presence of affluent residents.
• Indirect causation: x in reality is not causing y but it is causing something else, which then
causes y. A good municipality investing in infrastructures after higher taxes can indirectly affect
house prices because the area becomes more comfortable to live in, thus attracting more
demand. Higher taxes, and thus more investments, indirectly affect house prices.
• Conditional effect: x causes y in respect of the values of another variable z; for instance, when
z has certain values x is not influencing y but, when z takes particular values, the x starts
impacting y. We also call this situation interaction. For instance the presence of schools in an
area can become an attractor when the crime rate is low, so it affects house prices only when
there is little criminality.
Random effect: Any recorded correlation between x and y has been due to a lucky sampling
selection; in reality there is no relationship with y at all.
The ideal case is when you have a direct causation; then, you will have a predictor in your model that will
always provide you with the best values to derive your responses. In the other cases, it is likely that the
imperfect cause-effect relationship with the target variable will lead to more noisy estimates, especially in
production when you will have to work with data not seen before by the model. Reciprocal effects are
more typical of econometric models. They require special types of regression analysis. Including them in
your regression analysis may improve your model; however, their role may be underestimated. Spurious
and indirect causes will add some noise to your x and y relationship; this could bring noisier estimates
(larger standard errors). Often, the solution is to get more observations for your analysis. Conditional
effects, if not caught, can limit your model's ability to produce accurate estimates. If you are not aware of
any of them, given your domain knowledge of the problem, it is a good step to check for any of them
using some automatic procedure to test possible interactions between the variables. Random effects are
the worst possible thing that could happen to your model,
Xp = np.array([1,RM])
print ("Our model predicts if RM = %01.f the answer value is %0.1f" % (RM,
fitted_model.predict(Xp)))
A nice usage of the predict method is to project the fitted predictions on our previous scatterplot to allow
us to visualize the price dynamics in respect of our predictor, the average number of rooms:
x_range = [dataset['RM'].min(),dataset['RM'].max()]
y_range = [dataset['target'].min(),dataset['target'].max()]
color='red', linewidth=1)
Using the previous code snippet, we will obtain the preceding graphical representation of the regression
line containing a further indication of how such a line crosses the cloud of data points. For instance,
thanks to this graphical display, we can notice that the regression line exactly passes at the intersection of
the x and y averages.
Besides the predict method, generating the predictions is quite easy by just using the dot function in
NumPy. After preparing an X matrix containing both the variable data and the bias (a column of ones) and
the coefficient vectors, all you have to do is to multiply the matrix by the vector. The result will itself be a
vector of length equal to the number of observations:
predictions_by_dot_product = np.dot(X,betas)
linear_regression = linear_model.LinearRegression(normalize=False,fit_intercept=True)
Data preparation, instead, requires counting the observations and carefully preparing the predictor array
to specify its two dimensions (if left as a vector, the fitting procedure will raise an error):
observations = len(dataset)
X = dataset['RM'].values.reshape((observations,1))
After completing all the previous steps, we can fit the model using the fit method:
linear_regression.fit(X,y)
A very convenient feature of the Scikit-learn package is that all the models, no matter their type of
complexity, share the same methods. The fit method is always used for fitting and it expects an X and a y
(when the model is a supervised one). Instead, the two common methods for making an exact prediction
(always for regression) and its probability (when the model is probabilistic) are predict and predict_proba,
respectively.
After fitting the model, we can inspect the vector of the coefficients and the bias constant:
print (linear_regression.coef_)
print (linear_regression.intercept_)
Using the predict method and slicing the first 10 elements of the resulting list, we output the first 10
predicted values:
print (linear_regression.predict(X)[:10])
If we prepare a new matrix and we add a constant, we can calculate the results by ourselves using a
simple matrix–vector multiplication:
Xp = np.column_stack((X,np.ones(observations)))
As expected, the result of the product provides us with the same estimates as the predict method:
np.dot(Xp,v_coef)[:10]
The following simple test to generate a large dataset and check the performance of the two versions of
linear regression:
After generating ten million observations of a single variable, start by measuring using the %%time magic
function for IPython. This magic function automatically computes how long it takes to complete the
calculations in the IPython cell:
%%time
%%time
sm_linear_regression = sm.OLS(Hy,sm.add_constant(HX))
sm_linear_regression.fit()
There are quite a few methods to minimize it, some performing better than others in the presence of
large quantities of data. Among the better performers, the most important ones are Pseudoinverse, QR
factorization, and gradient descent.
• It emphasizes larger differences, because as they are squared they will proportionally increase the sum
of the errors compared to a simple sum of absolute values
import numpy as np
Let's also define a function returning the cost function as squared differences:
def squared_cost(v,e):
return np.sum((v-e)**2)
Using the fmin minimization procedure offered by the scipy package, we try to
figure out, for a vector (which will be our x vector of values), the value that makes
Iterations: 44
Function evaluations: 88
We just output our best e value and verify if it actually is the mean of the x vector:
If instead we try to figure out what minimizes the sum of absolute errors:
return np.sum(np.abs(v-e))
Iterations: 44
Function evaluations: 88