Simple and Multiple Linear Regression
Simple and Multiple Linear Regression
Linear Regression
Regression analysis is used to create a model that describes the relationship between a dependent variable and one or more
independent variables. Depending on whether there are one or more independent variables, a distinction is made between simple
and multiple linear regression analysis.
In the case of a simple linear regression, the aim is to examine the influence of an independent variable on one dependent variable.
In the second case, a multiple linear regression, the influence of several independent variables on one dependent variable is
analyzed.
In linear regression, an important prerequisite is that the scale of measurement of the dependent variable is scalar or metric and a
normal distribution exists. If the dependent variable is categorical, a logistic regression is used. You can easily perform a regression
analysis in the linear regression calculator here on DATAtab.
Dependent variable
Independent variables
The task of simple linear regression is now to exactly determine this straight line which best describes the linear relationship between
the dependent and independent variable. In linear regression analysis, a straight line is drawn in the scatter plot. To determine this
straight line, linear regression uses the method of least squares.
is the respective estimate of the y-value. This means that for each x-value the corresponding y-value is estimated. In our example,
this means that the height of people is used to estimate their weight.
If all points (measured values) were exactly on one straight line, the estimate would be perfect. However, this is almost never the case
and therefore, in most cases a straight line must be found, which is as close as possible to the individual data points. The attempt is
thus made to keep the error in the estimation as small as possible so that the distance between the estimated value and the true
value is as small as possible. This distance or error is called the "residual" and is abbreviated as "e" (error).
When calculating the regression line, an attempt is made to determine the regression coefficients (a and b) so that the sum of the
squared residuals is minimal. (OLS- "Ordinary Least Squares")
The regression coefficient b can now have different signs, which can be interpreted as follows
b > 0: there is a positive correlation between x and y (the greater x, the greater y)
b < 0: there is a negative correlation between x and y (the greater x, the smaller y)
b = 0: there is no correlation between x and y
Standardized regression coefficients are usually designated by the letter "beta". These are values that are comparable with each
other. Here the unit of measurement of the variable is no longer important. The standardized regression coefficient (beta) is
automatically output by DATAtab.
The standard estimation error is the standard deviation of the estimation error. This gives an impression of how much the prediction
differs from the correct value. Graphically interpreted, the standard estimation error is the dispersion of the observed values around
the regression line.
The coefficient of determination and the standard estimation error are used for simple and multiple linear regression.
Multiple linear regression is frequently used in empirical social research as well as in market research. In both areas it is of interest to
find out what influence different factors have on a variable. For example, what determinants influence a person's health or purchasing
behavior?
Marketing example:
For a video streaming service you should predict how many times a month a person streams videos. For this you get a record of
paste visitor data (age, income, gender, ...).
Medical example:
You want to find out which factors have an influence on the cholesterol level of patients. For this purpose, you analyze a patient
data set with cholesterol level, age, hours of sport per week and so on.
The equation necessary for the calculation of a multiple regression is obtained with k dependent variables to
The coefficients can now be interpreted similarly to the linear regression equation. If all independent variables are 0, the resulting
value is a. If an independent variable changes by one unit, the associated coefficient indicates by how much the dependent variable
changes. So if the independent variable xi increases by one unit, the dependent variable y increases by bi.
Linearity: There must be a linear relationship between the dependent and independent variables.
Homoscedasticity: The residuals must have a constant variance.
Normality: Normally distributed error
No Multicollinearity: No high correlation between the independent variables
Linearity
In linear regression, a straight line is drawn through the data. This straight line should represent all points as good as possible. If the
points are distributed in a non-linear way, the straight line cannot fulfill this task.
In the upper left graph, there is a linear relationship between the dependent and the independent variable, here the regression line
can be meaningfully put in. In the right graph you can see that there is a clearly non-linear relationship between the dependent and
the independent variable. Therefore it is not possible to put the regression line through the points in a meaningful way. Since this is
not possible, the coefficients cannot be meaningfully interpreted by the regression model or there could be errors in the prediction
that are greater than thought.
Therefore ist is important to check beforehand, whether a linear relationship between the dependent variable and each of the
independent variables exists. This is usually checked graphically.
Homoscedasticity
Since in practice the regression model never exactly predicts the dependent variable, there is always an error. This very error must
have a constant variance over the predicted range.
To test homoscedasticity, i.e. the constant variance of the residuals, the dependent variable is plotted on the x-axis and the error on
the y-axis. Now the error should scatter evenly over the entire range. If this is the case, homoscedasticity is present. If this is not the
case, heteroskedasticity is present. In the case of heteroscedasticity, the error has different variances, depending on the value range
of the dependent variable.
However, these analytical tests are used less and less because they tend to attest normal distribution for small samples and become
significant very quickly for large samples, thus rejecting the null hypothesis that the data are normally distributed. Therefore, the
graphical variant is increasingly used.
In the graphical variant, either the histogram is looked at or, even better, the so-called QQ-plot or Qunatil-quantile-plot. The more
the data lie on the line, the better the normal distribution.
Multicollinearity
Multicollinearity means that two or more independent variables are strongly correlated with one another. The problem with
multicollinearity is that the effects of each independent variable cannot be clearly separated from one another.
If, for example, there is a high correlation between x1 and x2, then it is difficult to determine b1 and b2. If both are e.g. completely
equal, the regression model does not know how large b1 and how large b2 should be. Therefore the regression model becomes
unstable.
This is of course not tragic if the regression model is only used for a prediction; in the case of a prediction, one is only interested in
the prediction, but not in how great the influence of the respective variables is. However, if the regression model is used to measure
the influence of the independent variables on the dependent variable, there must not be . If multicollinearity exists, the coefficients
cannot be interpreted meaningfully.
It should be noted, however, that the requirements in the previous section must be met.
The calculated F-value must now be compared with the critical F-value. If the calculated F-value is greater than the critical F-value,
the null hypothesis is rejected and the R2 deviates from zero in the population. The critical F-value can be read from the F-
distribution table. The numerator degrees of freedom are k and the denominator degrees of freedom are n-k-1.
where bj is the j. regression coefficient and sb_j is the standard error of bj. This test statistic is t-distributed with the degrees of
freedom n-k-1. The critical t-value can be read from the t-distribution table.
After you have copied your data into the statistics calculator, you must select the variables that are relevant for you. Then you receive
the results in table form.
Interpretation of the results
This table shows that 75.4% of the variation in weight can be determined by height, age and sex. The model estimates an average of
6.587 when predicting the weight of a person. The regression equation results in
The equation shows for example, that if the age increases by one year, the weight increases by 0.297 kg according to the model. In
the case of the dichotomous variable sex, the slope is to be interpreted as the difference, according to the model a man weighs 8.922
kg more than a woman. If all independent variables are zero, the result is a weight of -24.41.
The standardized coefficients beta are measured seperately and always range between -1 and +1. The greater beta is, the greater is
the contribution of each independent variable to explain the dependent variable. In this regression analysis, the variable age has the
greatest influence on the variable weight.
The calculated coefficients refer to the sample used for the calculation by the regression analysis, so it is of interest whether the B-
values deviate from zero only by chance or whether they are also different from zero in the population. For this purpose, the null
hypothesis is formulated that the respective calculated B value is equal to zero in the population. If this is the case, it means that the
respective dependent variable has no significant influence on the dependent variable.
The sigma value indicates whether a variable has a significant influence. Sigma values smaller than 0.05 are considered as
significant. In this example, only age can be considered as a significant predictor of the weight of a person.