Linear Regression Analysis in Excel
Linear Regression Analysis in Excel
The tutorial explains the basics of regression analysis and shows a few different ways to do linear
regression in Excel.
Imagine this: you are provided with a whole lot of different data and are asked to predict next year's
sales numbers for your company. You have discovered dozens, perhaps even hundreds, of factors that
can possibly affect the numbers. But how do you know which ones are really important? Run regression
analysis in Excel. It will give you an answer to this and many more questions: Which factors matter and
which can be ignored? How closely are these factors related to each other? And how certain can you
be about the predictions?
Dependent variable (aka criterion variable) is the main factor you are trying to understand and predict.
Regression analysis helps you understand how the dependent variable changes when one of the
independent variables varies and allows to mathematically determine which of those variables really
has an impact.
In statistics, they differentiate between a simple and multiple linear regression. Simple linear
regression models the relationship between a dependent variable and one independent variables
using a linear function. If you use two or more explanatory variables to predict the dependent variable,
you deal with multiple linear regression. If the dependent variable is modeled as a non-linear function
because the data relationships do not follow a straight line, use nonlinear regression instead. The
focus of this tutorial will be on a simple linear regression.
As an example, let's take sales numbers for umbrellas for the last 24 months and find out the average
monthly rainfall for the same period. Plot this information on a chart, and the regression line will
demonstrate the relationship between the independent variable (rainfall) and dependent variable
(umbrella sales):
y = bx + a + ε
Where:
The linear regression equation always has an error term because, in real life, predictors are never
perfectly precise. However, some programs, including Excel, do the error term calculation behind the
scenes. So, in Excel, you do linear regression using the least squares method and seek
coefficients a and b such that:
y = bx + a
For our example, the linear regression equation takes the following shape:
There exist a handful of different ways to find a and b. The three main methods to perform linear
regression analysis in Excel are:
Below you will find the detailed instructions on using each method.
How to do linear regression in Excel with Analysis
ToolPak
This example shows how to run regression in Excel by using a special tool included with the Analysis
ToolPak add-in.
With Analysis Toolpak added enabled, carry out these steps to perform regression analysis in Excel:
If you are building a multiple regression model, select two or more adjacent columns with different
independent variables.
o Check the Labels box if there are headers at the top of your X and Y ranges.
o Choose your preferred Output option, a new worksheet in our case.
o Optionally, select the Residuals checkbox to get the difference between the predicted
and actual values.
4. Click OK and observe the regression analysis output created by Excel.
Multiple R. It is the Correlation Coefficient that measures the strength of a linear relationship between
two variables. The correlation coefficient can be any value between -1 and 1, and its absolute
value indicates the relationship strength. The larger the absolute value, the stronger the relationship:
In our example, R2 is 0.91 (rounded to 2 digits), which is fairy good. It means that 91% of our values fit
the regression analysis model. In other words, 91% of the dependent variables (y-values) are explained
by the independent variables (x-values). Generally, R Squared of 95% or more is considered a good fit.
Adjusted R Square. It is the R square adjusted for the number of independent variable in the model.
You will want to use this value instead of R square for multiple regression analysis.
Standard Error. It is another goodness-of-fit measure that shows the precision of your regression
analysis - the smaller the number, the more certain you can be about your regression equation. While
R2 represents the percentage of the dependent variables variance that is explained by the model,
Standard Error is an absolute measure that shows the average distance that the data points fall from
the regression line.
df is the number of the degrees of freedom associated with the sources of variance.
SS is the sum of squares. The smaller the Residual SS compared with the Total SS, the better
your model fits the data.
MS is the mean square.
F is the F statistic, or F-test for the null hypothesis. It is used to test the overall significance of
the model.
Significance F is the P-value of F.
The ANOVA part is rarely used for a simple linear regression analysis in Excel, but you should definitely
have a close look at the last component. The Significance F value gives an idea of how reliable
(statistically significant) your results are. If Significance F is less than 0.05 (5%), your model is OK. If it
is greater than 0.05, you'd probably better choose another independent variable.
The most useful component in this section is Coefficients. It enables you to build a linear regression
equation in Excel:
y = bx + a
For our data set, where y is the number of umbrellas sold and x is an average monthly rainfall, our
linear regression formula goes as follows:
Equipped with a and b values rounded to three decimal places, it turns into:
Y=0.45*x-19.074
For example, with the average monthly rainfall equal to 82 mm, the umbrella sales would be
approximately 17.8:
0.45*82-19.074=17.8
In a similar manner, you can find out how many umbrellas are going to be sold with any other monthly
rainfall (x variable) you specify.
Why's the difference? Because independent variables are never perfect predictors of the dependent
variables. And the residuals can help you understand how far away the actual values are from the
predicted values:
For the first data point (rainfall of 82 mm), the residual is approximately -2.8. So, we add this number to
the predicted value, and get the actual value: 17.8 - 2.8 = 15.
This will insert a scatter plot in your worksheet, which will resemble this one:
3. Now, we need to draw the least squares regression line. To have it done, right click on any
point and choose Add Trendline… from the context menu.
4. On the right pane, select the Linear trendline shape and, optionally, check Display Equation
on Chart to get your regression formula:
As you may notice, the regression equation Excel has created for us is the same as the linear
regression formula we built based on the Coefficients output.
5. Switch to the Fill & Line tab and customize the line to your liking. For example, you can choose
a different line color and use a solid line instead of a dashed line (select Solid line in the Dash
type box):
At this point, your chart already looks like a decent regression graph:
Important note! In the regression graph, the independent variable should always be on the X axis and
the dependent variable on the Y axis. If your graph is plotted in the reverse order, swap the columns in
your worksheet, and then draw the chart anew. If you are not allowed to rearrange the source data,
then you can switch the X and Y axes directly in a chart.
The LINEST function uses the least squares regression method to calculate a straight line that best
explains the relationship between your variables and returns an array describing that line. You can find
the detailed explanation of the function's syntax in this tutorial. For now, let's just make a formula for our
sample dataset:
=LINEST(C2:C25, B2:B25)
Because the LINEST function returns an array of values, you must enter it as an array formula. Select
two adjacent cells in the same row, E2:F2 in our case, type the formula, and press Ctrl + Shift +
Enter to complete it.
The formula returns the b coefficient (E1) and the a constant (F1) for the already familiar linear
regression equation:
y = bx + a
If you avoid using array formulas in your worksheets, you can calculate a and b individually with regular
formulas:
=INTERCEPT(C2:C25, B2:B25)
=SLOPE(C2:C25, B2:B25)
Additionally, you can find the correlation coefficient (Multiple R in the regression analysis summary
output) that indicates how strongly the two variables are related to each other:
=CORREL(B2:B25,C2:C25)
The following screenshot shows all these Excel regression formulas in action:
Tip. If you'd like to get additional statistics for your regression analysis, use the LINEST function with
the stats parameter set to TRUE as shown in this example.
That's how you do linear regression in Excel. That said, please keep in mind that Microsoft Excel is not
a statistical program. If you need to perform regression analysis at the professional level, you may want
to use targeted software such as XLSTAT, RegressIt, etc.
Available downloads:
To have a closer look at our linear regression formulas and other techniques discussed in this tutorial,
you are welcome to download our sample Regression Analysis in Excel workbook.