Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

Simple and Multiple Linear Regression

notes BA

Uploaded by

suparshjain63
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Simple and Multiple Linear Regression

notes BA

Uploaded by

suparshjain63
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

>

Linear Regression
Regression analysis is used to create a model that describes the relationship between a dependent variable and one or more
independent variables. Depending on whether there are one or more independent variables, a distinction is made between simple
and multiple linear regression analysis.

In the case of a simple linear regression, the aim is to examine the influence of an independent variable on one dependent variable.
In the second case, a multiple linear regression, the influence of several independent variables on one dependent variable is
analyzed.

In linear regression, an important prerequisite is that the scale of measurement of the dependent variable is scalar or metric and a
normal distribution exists. If the dependent variable is categorical, a logistic regression is used. You can easily perform a regression
analysis in the linear regression calculator here on DATAtab.

Example: Simple Linear Regression


Does the height have an influence on the weight of a person?

Example: Multiple Linear Regression


Do the height and gender have have an influence on the weight of a person?

 Dependent variable
 Independent variables

Simple Linear Regression


The goal of a simple linear regression is to predict the value of a dependent variable based on an independent variable. The greater
the linear relationship between the independent variable and the dependent variable, the more accurate is the prediction. This goes
along with the fact that the greater the proportion of the dependent variable's variance that can be explained by the independent
variable is, the more accurate is the prediction. Visually, the relationship between the variables can be shown in a scatter plot. The
greater the linear relationship between the dependent and independent variables, the more the data points lie on a straight line.

The task of simple linear regression is now to exactly determine this straight line which best describes the linear relationship between
the dependent and independent variable. In linear regression analysis, a straight line is drawn in the scatter plot. To determine this
straight line, linear regression uses the method of least squares.

The regression line can be described by the following equation:


Definition of "Regression coefficients":

a : the point of intersection with the y-axis


b : the gradient of the straight line

is the respective estimate of the y-value. This means that for each x-value the corresponding y-value is estimated. In our example,
this means that the height of people is used to estimate their weight.

If all points (measured values) were exactly on one straight line, the estimate would be perfect. However, this is almost never the case
and therefore, in most cases a straight line must be found, which is as close as possible to the individual data points. The attempt is
thus made to keep the error in the estimation as small as possible so that the distance between the estimated value and the true
value is as small as possible. This distance or error is called the "residual" and is abbreviated as "e" (error).

When calculating the regression line, an attempt is made to determine the regression coefficients (a and b) so that the sum of the
squared residuals is minimal. (OLS- "Ordinary Least Squares")

The regression coefficient b can now have different signs, which can be interpreted as follows

b > 0: there is a positive correlation between x and y (the greater x, the greater y)
b < 0: there is a negative correlation between x and y (the greater x, the smaller y)
b = 0: there is no correlation between x and y

Standardized regression coefficients are usually designated by the letter "beta". These are values that are comparable with each
other. Here the unit of measurement of the variable is no longer important. The standardized regression coefficient (beta) is
automatically output by DATAtab.

Coefficient of determination and standard estimation error


In order to find out how well the regression model can predict or explain the dependent variable, two main measures are used. This
is on the one hand the coefficient of determination R2 and on the other hand the standard estimation error. The coefficient of
determination R2, also known as the variance explanation, indicates how large the portion of the variance is that can be explained by
the independent variables. The more variance can be explained, the better the regression model is. In order to calculate R2, the
variance of the estimated value is related to the variance in the observed values:
Variance of the predicted values

Variance of the observed values

The standard estimation error is the standard deviation of the estimation error. This gives an impression of how much the prediction
differs from the correct value. Graphically interpreted, the standard estimation error is the dispersion of the observed values around
the regression line.

The coefficient of determination and the standard estimation error are used for simple and multiple linear regression.

Multiple Linear Regression


Unlike simple linear regression, multiple linear regression allows more than two independent variables to be considered. The goal is
to estimate a variable based on several other variables. The variable to be estimated is called the dependent variable (criterion). The
variables that are used for the prediction are called independent variables (predictors).

Multiple linear regression is frequently used in empirical social research as well as in market research. In both areas it is of interest to
find out what influence different factors have on a variable. For example, what determinants influence a person's health or purchasing
behavior?

Marketing example:
For a video streaming service you should predict how many times a month a person streams videos. For this you get a record of
paste visitor data (age, income, gender, ...).

Medical example:
You want to find out which factors have an influence on the cholesterol level of patients. For this purpose, you analyze a patient
data set with cholesterol level, age, hours of sport per week and so on.

The equation necessary for the calculation of a multiple regression is obtained with k dependent variables to

The coefficients can now be interpreted similarly to the linear regression equation. If all independent variables are 0, the resulting
value is a. If an independent variable changes by one unit, the associated coefficient indicates by how much the dependent variable
changes. So if the independent variable xi increases by one unit, the dependent variable y increases by bi.

Multivariate Regression vs. Multiple Regression


Multiple regression should not be confused with multivariate regression. In the former case, the influence of several independent
variables on a dependent variable is examined. In the second case, several regression models are calculated to allow conclusions
to be drawn about several dependent variables. Consequently, in a multiple regression, one dependent variable is taken into
account, whereas in a multivariate regression, several dependent variables are analyzed.

Assumptions of Linear Regression


In order to interpret the results of the regression analysis meaningfully, certain conditions must be met.

Linearity: There must be a linear relationship between the dependent and independent variables.
Homoscedasticity: The residuals must have a constant variance.
Normality: Normally distributed error
No Multicollinearity: No high correlation between the independent variables

Linearity
In linear regression, a straight line is drawn through the data. This straight line should represent all points as good as possible. If the
points are distributed in a non-linear way, the straight line cannot fulfill this task.

In the upper left graph, there is a linear relationship between the dependent and the independent variable, here the regression line
can be meaningfully put in. In the right graph you can see that there is a clearly non-linear relationship between the dependent and
the independent variable. Therefore it is not possible to put the regression line through the points in a meaningful way. Since this is
not possible, the coefficients cannot be meaningfully interpreted by the regression model or there could be errors in the prediction
that are greater than thought.

Therefore ist is important to check beforehand, whether a linear relationship between the dependent variable and each of the
independent variables exists. This is usually checked graphically.

Homoscedasticity
Since in practice the regression model never exactly predicts the dependent variable, there is always an error. This very error must
have a constant variance over the predicted range.
To test homoscedasticity, i.e. the constant variance of the residuals, the dependent variable is plotted on the x-axis and the error on
the y-axis. Now the error should scatter evenly over the entire range. If this is the case, homoscedasticity is present. If this is not the
case, heteroskedasticity is present. In the case of heteroscedasticity, the error has different variances, depending on the value range
of the dependent variable.

Normal distribution of the error


The next requirement of linear regression is that the error epsilon must be normally distributed. There are two ways to find it out:
One is the analytical way and the other is the graphical way. In the analytical way, you can use either the Kolmogorov-Smirnov test or
the Shapiro-Wilk test. If the p-value is greater than 0.05, there is no deviation of the data from the normal distribution and one can
assume that the data are normally distributed.

However, these analytical tests are used less and less because they tend to attest normal distribution for small samples and become
significant very quickly for large samples, thus rejecting the null hypothesis that the data are normally distributed. Therefore, the
graphical variant is increasingly used.

In the graphical variant, either the histogram is looked at or, even better, the so-called QQ-plot or Qunatil-quantile-plot. The more
the data lie on the line, the better the normal distribution.

Multicollinearity
Multicollinearity means that two or more independent variables are strongly correlated with one another. The problem with
multicollinearity is that the effects of each independent variable cannot be clearly separated from one another.

If, for example, there is a high correlation between x1 and x2, then it is difficult to determine b1 and b2. If both are e.g. completely
equal, the regression model does not know how large b1 and how large b2 should be. Therefore the regression model becomes
unstable.
This is of course not tragic if the regression model is only used for a prediction; in the case of a prediction, one is only interested in
the prediction, but not in how great the influence of the respective variables is. However, if the regression model is used to measure
the influence of the independent variables on the dependent variable, there must not be . If multicollinearity exists, the coefficients
cannot be interpreted meaningfully.

More detailed information about multicollinearity can be found here

Significance test and Regression


The regression analysis is often carried out in order to make statements about the population based on a sample. Therefore, the
regression coefficients are calculated using the data from the sample. To rule out the possibility that the regression coefficients are
not just random and have completely different values in another sample, the results are statistically tested with significance test.
This test takes place at two levels.

Significance test for the whole regression model


Significance test for the regression coefficients

It should be noted, however, that the requirements in the previous section must be met.

Significance test for the regression model


Here it is checked whether the coefficient of determination R2 in the population differs from zero. The null hypothesis is therefore
that the coefficient of determination R2 in the population is zero. To confirm or reject the null hypothesis, the following F-test is
calculated

The calculated F-value must now be compared with the critical F-value. If the calculated F-value is greater than the critical F-value,
the null hypothesis is rejected and the R2 deviates from zero in the population. The critical F-value can be read from the F-
distribution table. The numerator degrees of freedom are k and the denominator degrees of freedom are n-k-1.

Significance test for the regression coefficients


The next step is to check which variables have a significant contribution to the prediction of the dependent variable. This is done by
checking whether the slopes (regression coefficients) also differ from zero in the population. The following test statistics are
calculated in order to analyze it

where bj is the j. regression coefficient and sb_j is the standard error of bj. This test statistic is t-distributed with the degrees of
freedom n-k-1. The critical t-value can be read from the t-distribution table.

Example Linear Regression


As an example of linear regression, a model is set up that predicts the body weight of a person. The dependent variable is thus the
body weight, while the height, age and gender are chosen as independent variables. The following example data set is available:

weight height age gender


79 1.80 35 male
69 1.68 39 male
73 1.82 25 male
95 1.70 60 male
82 1.87 27 male
55 1.55 18 female
69 1.50 89 female
71 1.78 42 female
64 1.67 16 female
69 1.64 52 female

After you have copied your data into the statistics calculator, you must select the variables that are relevant for you. Then you receive
the results in table form.
Interpretation of the results
This table shows that 75.4% of the variation in weight can be determined by height, age and sex. The model estimates an average of
6.587 when predicting the weight of a person. The regression equation results in

Weight = 47,379 · Height + 0,297 · Age + 8,922 · is_male -24.41

The equation shows for example, that if the age increases by one year, the weight increases by 0.297 kg according to the model. In
the case of the dichotomous variable sex, the slope is to be interpreted as the difference, according to the model a man weighs 8.922
kg more than a woman. If all independent variables are zero, the result is a weight of -24.41.

The standardized coefficients beta are measured seperately and always range between -1 and +1. The greater beta is, the greater is
the contribution of each independent variable to explain the dependent variable. In this regression analysis, the variable age has the
greatest influence on the variable weight.

The calculated coefficients refer to the sample used for the calculation by the regression analysis, so it is of interest whether the B-
values deviate from zero only by chance or whether they are also different from zero in the population. For this purpose, the null
hypothesis is formulated that the respective calculated B value is equal to zero in the population. If this is the case, it means that the
respective dependent variable has no significant influence on the dependent variable.

The sigma value indicates whether a variable has a significant influence. Sigma values smaller than 0.05 are considered as
significant. In this example, only age can be considered as a significant predictor of the weight of a person.

You might also like